Processing
files with awk
The awk
processing utility can practically be used as a programming
language -- but first you need to learn its simpler features. In
the first of two columns on awk, we show you how it breaks
records into fields and how to execute more than one set of
commands on a record.
Summary
Awk is a text processing utility that runs through a text file
by reading and processing a record at a time. We start with
the basics and move to executing more than one set of commands
with awk. We give you multiple examples of awk processes. (2,700
words)
There have been several requests for
information on awk, and I happen to like it as a utility, so
this column and next month's column will cover awk.
Awk is a flexible text processing utility that can be used
almost as a programming language. You can do a great deal with
awk once you learn just a few of its simple features.
Awk runs through a text file by reading and processing one
record at time. Its commands are written with the intention that
they act repetitively on each record as it is read in to awk. A
record that has been read by awk is broken into separate fields,
and actions can be performed on the separate fields as well as
on the whole record.
The actions or steps to be performed on the fields in each
record or on the whole record make up an awk program or an awk
script.
When you type awk as a command, you must also
provide two additional pieces of information or arguments. The
first is the program or script to be executed, and the second is
some method of identifying the file on which to perform the
actions. Awk can be used as a pipe, and the file does not need
to be explicitly named on the command line.
Starting with the basics
Let's start with a simple awk command in Figure 1 to get a
better idea of how it works.
Figure 1
ls -l|awk '{print}'
The output of the ls -l command has been piped into awk and
is the "file" to be processed. There is no need to
name a file in the awk portion of the command line. The awk
program or script is one command, {print}. This example doesn't
do much. It takes the whole record that was sent to awk and
prints it on the screen. This simple command does partially
illustrate the record-by-record action of awk. For each record
received by the awk program (each line of the output of the ls
-l command), the print instruction is executed. It is important
to remember this action by awk. Each record is read, then for
each record, the instructions in the program are executed.
The output of this program is pretty uninteresting and will
look something like Figure 2 depending on the contents of your
directory.
Figure 2
-rw-r--r-- 1 mjb group 109 Mar 09 18:32 store.dat
-rw-r--r-- 1 mjb group 93 Mar 09 18:31 store.sav
-rwxr-xr-x 1 mjb group 3058 Mar 09 18:29 store.txt
-rw-r--r-- 1 mjb group 89 Mar 09 18:32 sort.dat
-rw-r--r-- 1 mjb group 193 Mar 09 18:31 sort.sav
-rwxr-xr-x 1 mjb group 2068 Mar 09 18:29 sort.txt
-rw-r--r-- 1 mjb group 20 Mar 09 18:31 palet.txt
So far nothing very exciting has happened. In fact, this is
exactly the same output as the simple ls -l command. Obviously
there must be more to awk.
Onward! Breaking down into fields
Awk automatically breaks a record into fields. The default
delimiter that awk assumes between fields is spaces. In Figure
2, field 1 is "-rwxr-xr-x" for the first record, field
2 is "1," field 3 is "mjb," and so on.
When awk reads in a record and breaks the contents of the
record into fields, it assigns a variable name to each field.
These variable names are a dollar sign ($) followed by the
number of the field counting from left to right. The variable $1
represents the contents of field 1 which in Figure 2 would be
"-rwxr-xr-x." $2 represents field 2 which is
"1" in Figure 2 and so on. The awk variables $1 or $2
through $nn represent the fields of each record and should not
be confused with shell variables that use the same style of
names. Inside an awk script $1 refers to field 1 of a record; $2
to field 2 of a record.
In the first awk example, the print command on its own caused
the entire record to be printed. The print command followed by
specific field variables will print only those fields named by
the variables, instead of the entire record. Let's look at an
example. To extract the owner, size, and file name from the
output of an ls -l files listing, you would need to print only
fields 3, 5, and 9. The command for doing this is illustrated in
Figure 3. Note that $3, $5, and $9 appear inside the awk script
'{print $3 $5 $9}' and are therefore interpreted by awk as awk
field variables. The single quotes protect the awk field
variables from the shell, so there is no attempt to expand them.
It is good practice to get in the habit of including opening and
closing single quotes around awk commands to protect them from
shell expansion.
Figure 3
ls -l|awk '{print $3 $5 $9}'
The problem with the output of this command is shown in
listing Figure 4. There are no spaces between fields.
Figure 4
mjb109store.dat
mjb93store.sav
mjb3058store.txt
mjb89sort.dat
mjb193sort.sav
mjb2068sort.txt
mjb20palet.txt
One way around this is to embed literals in the print line as
in Figure 5, which puts spaces in the output lines, producing
the output shown in Figure 6.
Figure 5
ls -l|awk '{print $3 " " $5 " " $9}'
Figure 6
mjb 109 store.dat
mjb 93 store.sav
mjb 3058 store.txt
mjb 89 sort.dat
mjb 193 sort.sav
mjb 2068 sort.txt
mjb 20 palet.txt
This provides some spacing, but the fields don't line up very
well. One simple way to improve alignment is to embed tabs in
the literals instead of spaces. Repeat the command line in
Figure 5, but instead of pressing the space bar between the
double quotes, press the TAB key. You will not see any
characters on the screen, but the double quotes will be
separated by what appear to be more spaces. These
"more" spaces are actually a tab character. The result
will look something like Figure 7. Figure 8 is an example of the
output.
Figure 7
ls -l|awk '{print $3 " " $5 " "$9}'
Figure 8
mjb 109 store.dat
mjb 93 store.sav
mjb 3058 store.txt
mjb 89 sort.dat
mjb 193 sort.sav
mjb 2068 sort.txt
mjb 20 palet.txt
In one more variation, we can switch the order of the fields
during printing as in the listing in Figure 9 and the output in
Figure 10. In this and subsequent examples I will use the ^
(caret) character to indicate a tab key pressed.
Figure 9
ls -l|awk '{print $9 " ^" $5 " ^"$3}' (<-- note ^ = TAB key)
Figure 10
store.dat 109 mjb
store.sav 93 mjb
store.txt 3058 mjb
sort.dat 89 mjb
sort.sav 193 mjb
sort.txt 2068 mjb
palet.txt 20 mjb
Executing more than one set of
commands
Figure 11 adds two more features of awk. You may execute more
than one set of commands on a record by separating the commands
with a semicolon (;), and awk allows flexible use of
user-defined variables within scripts. In this example a
variable is used to keep a running record of the total number of
bytes displayed in each line so far. As each record is
processed, field $5 is summed into the variable ttl before the
printing takes place; then as the fields are printed, the ttl
variable is printed on each line as a running total of bytes for
the sizes of files.
The variable ttl is initialized to zero the first time it is
used. Since the ttl variable is accessed once each time a record
is read, it is accessed for the first time when the first record
is read. When this first read happens, and the first reference
to variable ttl is made, ttl is automatically set to zero. The
syntax "ttl += $5" is borrowed from C. In other
program languages it would be necessary to write something like
this:
add $5 to ttl
or
ttl = ttl + 5
Awk uses += as a shorthand for "add to."
Awk initializes all variables to 0 when they are used for
numbers and to "" when they are used for string
storage. Awk is flexible about its variables, and you do not
have to identify them as numeric or string types before using
them. The ttl variable could have been used as a string holder,
but since it is used for numeric information it starts life as a
zero when the first record is read, and thereafter immediately
has the contents of field $5 added to it.
As a note on Figure 11, press the TAB key after the double
quote but before "Total."
Figure 11
ls -l|awk '{ttl+=$5; print $9 " ^" $5 " ^"$3 " ^Total " ttl " bytes"}'
Figure 12 is the output of Figure 11.
Figure 12
store.dat 109 mjb Total 109 bytes
store.sav 93 mjb Total 212 bytes
store.txt 3058 mjb Total 3270 bytes
sort.dat 89 mjb Total 3359 bytes
sort.sav 193 mjb Total 3552 bytes
sort.txt 2068 mjb Total 5620 bytes
palet.txt 20 mjb Total 5640 bytes
Line splitting
Awk examples are gradually getting too long for a single line,
so we will have to start splitting the lines. If you are not
using the C shell, one way to do this is to press enter after
you have typed the initial opening single quote before the awk
commands. The line will be continued allowing you to enter one
or more commands until the final closing single quote is typed.
This can be used to break an awk script or program into several
separate lines. Figure 13 is an example. In Figure 14 the output
is identical to Figure 12.
Figure 13
ls -l|awk ' <- once the open quote is typed, press enter
{ttl+=$5; <- and continue on the next lines
print $9 " ^" $5 " ^"$3 " ^Total " ttl " bytes"}
' <- until the final closing quote
Figure 14
store.dat 109 mjb Total 109 bytes
store.sav 93 mjb Total 212 bytes
store.txt 3058 mjb Total 3270 bytes
sort.dat 89 mjb Total 3359 bytes
sort.sav 193 mjb Total 3552 bytes
sort.txt 2068 mjb Total 5620 bytes
palet.txt 20 mjb Total 5640 bytes
For the C shell, use the backslash as the line continuation
character as shown in Figure 15. Further examples will assume
that you are using sh, ksh, or one of its derivatives. If you
are using csh, then be sure to include the backslash
continuation characters.
Figure 15
ls -l|awk ' \ <- use the backslash to force a continuation
{ttl+=$5; \ <- on each line
print $9 " ^" $5 " ^"$3 " ^Total " ttl " bytes"} \
' <- until the final closure, then press enter
A running total is fine, but what I really wanted here was a
total bytes count at the end of the listing.
Although the awk default is to perform all commands on each
record, awk also allows actions to be performed before the first
record is read, and/or after the last record is processed.
Commands to be executed at the beginning or end of the records
are set off by the key words BEGIN and END. Figure 16, is an
example of the END key word. The values in field $5 are still
accumulated in the ttl variable, but the total in ttl is printed
as part of the END action instead of with each record.
Figure 16
ls -l|awk '
{ttl+=$5;
print $9 " ^" $5 " ^"$3}
END{print "Total " ttl " bytes"}'
Figure 17 is the output of Figure 16 and you will see that
the total is printed as a final line after the last directory
entry.
Figure 17
store.dat 109 mjb
store.sav 93 mjb
store.txt 3058 mjb
sort.dat 89 mjb
sort.sav 193 mjb
sort.txt 2068 mjb
palet.txt 20 mjb
Total 5640 bytes
Figure 18 adds the use of the BEGIN key word and Figure 19
shows the output with the heading created with the BEGIN
statement.
Figure 18
ls -l|awk '
BEGIN{print "Custom Directory Listing"}
{ttl+=$5;
print $9 " ^" $5 " ^"$3}
END{print "Total " ttl " bytes"}'
Figure 19
Custom Directory Listing
store.dat 109 mjb
store.sav 93 mjb
store.txt 3058 mjb
sort.dat 89 mjb
sort.sav 193 mjb
sort.txt 2068 mjb
palet.txt 20 mjb
Total 5640 bytes
Figure 20 is a pseudo-listing of the three parts of the awk
script. The middle section is marked "each record,"
but this is not an awk keyword. It is inserted to make the
pseudo-listing clearer.
Figure 20
ls -l|awk '
BEGIN {print "Custom Directory Listing"}
each record {ttl+=$5;print $9 " ^" $5 " ^"$3}
END {print "Total " ttl " bytes"}'
Take another look at Figure 19 for an additional problem that
can be fixed with a feature of awk. There is a blank line
between "Custom Directory Listing" and the line
containing the first file. Why? I fudged a bit in the earlier
part of this article. The real result of an ls -l actually looks
more like Figure 21. The total blocks are listed on the first
line.
Figure 21
total 18
-rw-r--r-- 1 mjb group 109 Mar 09 18:32 store.dat
-rw-r--r-- 1 mjb group 93 Mar 09 18:31 store.sav
-rwxr-xr-x 1 mjb group 3058 Mar 09 18:29 store.txt
-rw-r--r-- 1 mjb group 89 Mar 09 18:32 sort.dat
-rw-r--r-- 1 mjb group 193 Mar 09 18:31 sort.sav
-rwxr-xr-x 1 mjb group 2068 Mar 09 18:29 sort.txt
-rw-r--r-- 1 mjb group 20 Mar 09 18:31 palet.txt
Awk sees the line containing "total 18" as the
first record that it processes. This first record only has
fields $1 and $2, so fields $3, $5, and $9 are blank for the
first record. The print command on this first record is actually
printing 3 blank fields from the first record. These show up as
a single blank line, but this single line provides an
opportunity to show another part of the awk language.
"If" tests and conditions
An if test can be used to eliminate an unwanted record. Figure
22 includes an if test which uses the next statement, on line 3.
The if test is straight forward except that awk uses
"==" (equal equal) for "is equal to." In
English this would read, "If the first field is equal to
`total'..."
The next statement causes awk to skip all further actions on
this record and to loop back to the top of the logic that reads
the next input record.
Figure 22
ls -l|awk '
BEGIN{print "Custom Directory Listing"}
{if($1 == "total") next;
ttl+=$5;
print $9 " ^" $5 " ^"$3}
END{print "Total " ttl " bytes"}'
Figure 23 is an illustration of the steps that happen in awk
record processing as the if condition is tested, and what the
next does. Note that step 1 in the illustration, read a record,
is the automatic default of action of awk, and there is no awk
command to read a record.
Figure 23 The logic in an if-next statement
1. read a record < the automatic action in awk
2. { if ($1 == "total") < test the first field
3. next; < if true go to step 1
4. ttl += $5; < otherwise continue
5. (rest of the code)
Even simple if tests such as the one shown here can add a
powerful tool to awk processes.
This is about all I have space for in this edition, so join
me next month for some more advanced features in awk, including
better formatting and processing of files whose fields are not
separated by spaces.
Contact
us for a free consultation. |