Too small
to keep, too big to throw back
Small fry Unix
commands, part 3: du, tr, nl
Summary
This is the third installment in a series on small-but-useful
Unix commands. Mo covers the du (summarize disk
usage), tr (translate characters), and nl
(line numbering filter) commands. (1,900 words)
In my March '98 article on file
compression, I asked the question: How big is a file, anyway?
This month I am going to expand that question to: How big is a
directory, anyway?
If the directory only contains files, it's easy enough to
issue an ls -ls command and get the sizes of files
in bytes and blocks.
$ ls -ls
total 6
2 -rw-r--r-- 1 mjb group 3 Feb 04 23:31 minutes.txt
4 -rw-r--r-- 1 mjb group 1201 Feb 04 23:25 note.txt
The first column contains the size of the file in 512-byte
blocks, and the sixth column gives the size of the file in
bytes. Files in this directory consume 6 blocks, containing only
1204 bytes. In the March column, I discussed allocation units --
the minimum space allocated by the operating system for a file.
You should review that article for more details, but here's a
brief explanation of how allocation units work.
This method is used in all major operating systems in one
form or another. Some convenient number of bytes is selected as
the minimum amount that can be allocated to a file. This amount
is an allocation unit. If the file doesn't use all the
space in an allocation unit, it's recorded at the beginning of
the unit, with the remaining space set aside to accommodate
further expansion of that file.
As you add to the file, the new data is stored in the empty
reserved space on the disk, so long as it doesn't exceed the
number of bytes permitted in an allocation unit. Once the file
has used all available space, another allocation unit is grabbed
and reserved. Any spillover from the first allocation unit is
tucked in at the start of second allocation unit, and so on.
Earlier Unix systems used an allocation unit of 512 bytes.
These 512 bytes came to be known as a block. As disk sizes grew,
the basic allocation unit was increased to 1024 bytes on most
systems (larger on some), but many utilities, such as ls
above, still report file sizes or disk use in 512 byte blocks.
So, the 3-byte file uses 2 blocks.
In the following example, the directory in question includes
a subdirectory, perl. The 2 blocks allocated for the
perl directory are the blocks used only by the directory itself,
not those used by the files in the directory.
$ ls -ls
total 6
2 -rw-r--r-- 1 mjb group 3 Feb 04 23:31 minutes.txt
4 -rw-r--r-- 1 mjb group 1201 Feb 04 23:25 note.txt
2 drwxr-xrx 2 mjb group 128 Jan 29 18:53 perl
We could figure out the sizes, by doing an ls -ls perl ,
but suppose there's another directory under perl? And what if
there were a third directory beneath that one?
How do you du?
The solution to this dilemma is the Unix utility du .
This little utility will recurse through all subdirectories and
display all the blocks being used. In the display below, the
directory being processed contains a perl subdirectory, which in
turn contains a src subdirectory. The src directory contains
files totaling 1540 blocks. The perl directory count includes
all the blocks in src plus the blocks used by files in perl.
Finally, the top level includes all blocks below it, plus blocks
used by files used in the current directory.
$ du
1540 ./perl/src
5648 ./perl
5654 .
The -a option displays the details for each
file.
$ du -a
1500 ./perl/src/big.prl
40 ./perl/src/prog.prl
1540 ./perl/src
4108 ./perl/perl.tar
5648 ./perl
2 ./minutes.txt
4 ./note.txt
5654 .
The du command will cut through a lot of ls
commands. It provides size information as well as a reasonable
display of the directory tree.
Switching things around with tr
The tr utility translates one set of characters
into another. The command tr abc def test.txt will
process the records from test.txt and will translate
the letter a to d, the letter b to e
and the letter c to f. At first glance this
doesn't seem very useful, unless you want to practice amateur
cryptography, but tr has additional options that
make it much more powerful. Two examples should give you a feel
for the command.
The characters to be translated can be expressed as a range.
In the command below, a directory is output through tr ,
which translates a to A, b to B and
so on -- converting everything from lowercase to uppercase.
$ ls -ls|tr [a-z] [A-Z]
TOTAL 6
2 -RW-R--R-- 1 MJB GROUP 3 FEB 04 23:31 MINUTES.TXT
4 -RW-R--R-- 1 MJB GROUP 1201 FEB 04 23:25 NOTE.TXT
2 DRWXR-XRX 2 MJB GROUP 128 JAN 29 18:53 PERL
Using tr in the real world
Among other things, case conversion solves a problem created by
some utilities that copy MS-DOS files onto a system. They copy
the files using the uppercase convention of MS-DOS, and the file
names need to be converted to lowercase to work correctly.
Assuming a directory full of files named in uppercase, the
following command will rename all the files to lowercase
versions. The command takes each file name and echoes it through
a pipe using tr to change uppercase to lowercase.
The result is used as the target of a mv command.
$ for name in *
> do
> mv $name `echo $name|tr [A-Z] [a-z]`
> done
$
tr includes the -s switch, which
squeezes repeating instances of the output characters to one
instance. In the following example, the file test.txt
contains one line with several spaces between the words. The tr
command translates each space into another space, but the -s
option compacts multiple spaces into a single output space. The
resulting file, test2.txt, has a single space between
each word.
$ type test.txt
How are you today?
$ tr -s " " " " < test.txt >test2.txt
$ type test2.txt
How are you today?
Fancy line numbering with nl
The nl utility adds line numbers to a file.
Although this would seem like a simple task, nl has
a great number of options. To illustrate some of these options,
we're going to undertake the old-fashioned task of adding line
numbers to a Cobol program. I chose this example because it's a
great way of illustrating many of the features of nl .
The following listing is hello.txt, a Cobol program with missing
line numbers.
$ type hello.txt
IDENTIFICATION DIVISION.
PROGRAM-ID. HELLO.
ENVIRONMENT DIVISION.
DATA DIVISION.
PROCEDURE DIVISION.
PROGRAM-BEGIN.
DISPLAY "Hello world."
PROGRAM-DONE.
STOP RUN.
The first pass at this is simply to add line numbers, as in
the following listing. The output has several problems.
The numbers in this listing start at one and rise in
increments of one. Cobol usually operates in increments of 10 or
100, although one is valid. Cobol numbering also includes
leading zeroes, which this listing doesn't display. Blank lines
should be numbered but aren't. Finally, nl 's
default behavior is to add a tab separator after the number and
before the original line. Though the tabs are not visible in
this listing, many Cobol compilers can't handle them at all.
$ nl
hello.cbl
type hello.cbl
1 IDENTIFICATION DIVISION.
2 PROGRAM-ID. HELLO.
3 ENVIRONMENT DIVISION.
4 DATA DIVISION.
5 PROCEDURE DIVISION.
6 PROGRAM-BEGIN.
7 DISPLAY "Hello world."
8 PROGRAM-DONE.
9 STOP RUN.
Let's tackle these problems one at a time. The separator
character can be specified as an ordinary space using the -s
switch (as in -s" " ). The first modified
version of the command is shown below.
$ nl -s" "
hello.cbl
The format for the number itself is controlled by several
options. The -w option specifies the width of the
number. For Cobol, this width is six. The default for nl
happens to be six, but I'll include the option to be thorough.
The -v option lets you specify the starting number,
and -i lets you specify the increment. In the
listing below, I've specified a space separator, and a width of
6 digits, starting at 100 and going up in increments of 100.
$ nl -s" " -w6 -v100 -i100
hello.cbl
type hello.cbl
100 IDENTIFICATION DIVISION.
200 PROGRAM-ID. HELLO.
300 ENVIRONMENT DIVISION.
400 DATA DIVISION.
500 PROCEDURE DIVISION.
600 PROGRAM-BEGIN.
700 DISPLAY "Hello world."
800 PROGRAM-DONE.
900 STOP RUN.
This is closer, but it still needs work. The number format is
controlled by the -n option. There are three
formats. Left-justified with leading zeroes suppressed is
represented as -nln . Right justified with leading
zeroes suppressed is -nrn . (This is the default.)
Right-justified with leading zeroes kept is -nrz . I
use -nrz in the following listing:
$ nl -s" " -w6 -v100 -i100 -nrz
hello.cbl
type hello.cbl
000100 IDENTIFICATION DIVISION.
000200 PROGRAM-ID. HELLO.
000300 ENVIRONMENT DIVISION.
000400 DATA DIVISION.
000500 PROCEDURE DIVISION.
000600 PROGRAM-BEGIN.
000700 DISPLAY "Hello world."
000800 PROGRAM-DONE.
000900 STOP RUN.
The default behavior of nl is to skip blank
lines, as shown above. The treatment of blank lines can be
modified with the -b switch. Some -b
options are -ba (number all lines), -bt
(number only text lines -- the default behavior), and -bpstring
(number only lines containing the string "string").
This last option is interesting. An artificial example of this
is shown in the following listing. Here, only lines containing
the word PROGRAM are numbered.
$ nl -s" " -w6 -v100 -i100 -nrz -bpPROGRAM
hello.cbl
type hello.cbl
IDENTIFICATION DIVISION.
000100 PROGRAM-ID. HELLO.
ENVIRONMENT DIVISION.
DATA DIVISION.
PROCEDURE DIVISION.
000200 PROGRAM-BEGIN.
DISPLAY "Hello world."
000300 PROGRAM-DONE.
STOP RUN.
But what we really want is the -ba option to
number all lines. In the following listing we have the final
version of the command, and the result.
$ nl -s" " -w6 -v100 -i100 -nrz -ba
hello.cbl
type hello.cbl
000100 IDENTIFICATION DIVISION.
000200 PROGRAM-ID. HELLO.
000300 ENVIRONMENT DIVISION.
000400 DATA DIVISION.
000500 PROCEDURE DIVISION.
000600
000700 PROGRAM-BEGIN.
000800 DISPLAY "Hello world."
000900
001000 PROGRAM-DONE.
001100 STOP RUN.
The nl program sounds deceptively simple at
first, but it performs a wide range of numbering tasks. It also
includes switches for recognizing the start of new pages, for
numbering pages, and to start numbering at the beginning again
so that the lines on each page can start at one.
Contact
us for a free consultation. |