Compressing files in Unix
Understanding how disk space is allocated can help you get more from your drives
Summary
This month we'll cover the basics of file compression in Unix. We'll explain why small files
may be larger than large files, why one file may be better than two, and what you can do to
squeeze more space out of your disk, including how to use the tape archive (tar )
utility to better compress small files. (2,100 words)
How
big is a file, anyway?
You might expect the answer to that to be pretty simple. Type ls-l
and the directory listing tells you how many bytes are in the file.
In the example listing below, minutes.txt is 3 bytes long
(must have been a short meeting) and note.txt is 1201
bytes long.
$ ls -l
total 6
-rw-r--r-- 1 mjb group 3 Feb 04 23:31 minutes.txt
-rw-r--r-- 1 mjb group 1201 Feb 04 23:25 note.txt
But those are the file sizes, not the amount of space used
on the disk. To see the space used on the disk add the -s switch
by typing ls -ls . The new listing (shown below) includes an initial
column that contains the number of blocks used on the disk by the
file. A block is a unit of 512 bytes. The first file, minutes.txt,
uses 2 blocks or 1024 bytes, (suddenly that meeting doesn't seem so
short) and note.txt uses 4 blocks, a whopping 2048 bytes.
$ ls -ls
total 6
2 -rw-r--r-- 1 mjb group 3 Feb 04 23:31 minutes.txt
4 -rw-r--r-- 1 mjb group 1201 Feb 04 23:25 note.txt
What's happening here? It would be nice if a 3-byte file actually used
3 bytes, plus maybe a few bytes for the file name and other
information in the directory. Unfortunately that's never been a
practical way to organize a disk. The overhead in keeping track of
the directory would become a load on the system. Also, as a file
expanded and contracted, due to editing and data entry, it would
become heavily fragmented. The file's first 3 bytes would be down on track 12,
the next 14 bytes over on track 25, and future additions spread
out to track 64. The directory would become a hodgepodge of pointers, and loading
files into the editor would require the read heads to scramble all over the disk
collecting directory information and tiny bits of file.
To handle this problem, a compromise was reached in disk organization.
A convenient number of bytes was selected as the minimum
amount that could be allocated to a file. This amount could be
called an allocation unit. If a file didn't use all the space in
its allocation unit, the remainder of the unit would be set aside
for future expansion. As a file expanded, so long as
it didn't exceed the number of bytes in its allocation unit, all
new information was stored in the reserved space on the disk.
Once the file exceeded that space, another allocation unit
was grabbed and reserved. Any spill over from the first allocation
unit would be tucked into the new one, and so on.
Now the directory had only to locate the first allocation unit.
This method is used in all major operating systems in one
form or another.
Earlier Unix systems used an allocation unit of 512 bytes. These 512 bytes
made up 1 block. As disk sizes grew, the basic allocation unit was increased
to 1024 bytes on most systems (larger on some), but many utilities, such as ls ,
continue to report file sizes or disk use in 512-byte blocks. This block size
remains the standard for many utilities, even though the actual size of an allocation
unit has increased to 2 or more blocks.
Black holes
With this background, let's now look at the ls- ls listing
again. A 3-byte file like minutes.txt will occupy a 512-byte block, but more
importantly for disk usage, it will occupy 1 allocation unit,
which on the system in the example below is 2 blocks, or 1024 bytes. The ls -ls
listing correctly indicates 2 blocks used on the disk. Similarly, note.txt is 1,201
bytes, and should therefore occupy 3 blocks (1024 = 2 complete blocks plus an additional 177
bytes in a third block). The note.txt file actually uses 2 allocation units or 4 blocks, as
indicated in the listing.
$ ls -ls
total 6
2 -rw-r--r-- 1 mjb group 3 Feb 04 23:31 minutes.txt
4 -rw-r--r-- 1 mjb group 1201 Feb 04 23:25 note.txt
This seems dreadfully wasteful. In fact 99.7 percent of the space allocated
for minutes.txt is unused, and 41.4 percent of the space for note.txt is
wasted. Multiply this by the number of files on the system, and you'll
begin to imagine vast black holes of disk space that cannot be
reached except by forcing all users to create and fill files
that are multiples of 1024.
Before you start hyperventilating, remember that the high percentage
of waste only occurs on very small files, so the larger the file
the more efficient the allocation system is. If you allow that
your system is probably working fairly well, you'll recognize that
the allocation system is a good compromise between disk allocation
and speed of disk access.
One useful task is to establish the allocation unit size of
your system. You can probably plow through manuals for this
information, but a simpler method is to read the manpage for ls .
Establish the block size used by the -s option (usually 512 bytes),
then use vi to create a file with only a few bytes in it and close
the file. Type ls -ls to look at the number of blocks used for
that small file, multiply it by the block size, and you have your basic
allocation unit size.
One other useful note: ls-l , ls -s , and similar variations
display a total line as the first record in a directory display. The
total 6 in the listing below is in fact the sum of the blocks
displayed by typing ls -ls .
$ ls -ls
total 6
2 -rw-r--r-- 1 mjb group 3 Feb 04 23:31 minutes.txt
4 -rw-r--r-- 1 mjb group 1201 Feb 04 23:25 note.txt
Compression
Now that you've identified one type of disk-eating file, what can you do about it?
You've probably heard of or even used some of the various
file-compression utilities, such as pack , compress , and the GNU
(which stands for GNU's not Unix) software utility, gzip .
These utilities work very well on large files, but perform poorly on
small files. In the sample listing below, compress is applied to
each of the files and the results are displayed. The compress
utility correctly recognizes that it can't do any good on
minutes.txt, and leaves it alone. It does, however, compress note.txt to
188 bytes. Note that compress appends .Z to a file when it
compresses it. The effects of compress are reversed by using
uncompress , or compress -d file.ext . You don't need to
include the .Z in the file name.
In this case we've eliminated 2 blocks, as note.txt compressed
down to 2 blocks from 4. If you follow this logic through, you
begin to realize that a small file can never be compressed below 2
blocks (or the default allocation unit for your system).
$ ls -ls
total 6
2 -rw-r--r-- 1 mjb group 3 Feb 04 23:31 minutes.txt
4 -rw-r--r-- 1 mjb group 1201 Feb 04 23:25 note.txt
$ compress minutes.txt
$ compress note.txt
ls -ls
total 4
2 -rw-r--r-- 1 mjb group 3 Feb 04 23:31 minutes.txt
2 -rw-r--r-- 1 mjb group 188 Feb 04 23:25 note.txt.Z
$
Compressing files with tar
If you have a directory of small files that are little used, but
need to remain on the system, one way to handle them is to combine
them into one file, then remove the originals. If the files can be
strung together, all the little files can be packed into one
larger file. The obvious candidate for this combining action is the tar
(tape archive) utility, so we'll try it.
Study the following listing for a moment. The tar command uses key
letters to signify actions to be performed. These are a bit like command line switches,
but are not preceded by - . In this instance, the tar arguments
are:
c |
create a new archive |
v |
verbose, provide information on what you are doing |
f |
the next argument is the name of the archive to create txt.tar |
txt.tar |
the archive that is being created |
*.txt |
the list of files to include in the archive |
Immediately after you type the tar command, tar informs you that it
has appended minutes.txt, which would take 1 tape block
(a minutes.txt 1 tape block ), and appended
note.txt, which would take 3 tape blocks (a
note.txt 3 tape blocks ). So tar reports it results in 512-byte
blocks rather than 1024-byte double blocks.
However, there is a bit of a shock in the ls -ls command issued
after the tar is complete. The new archive txt.tar is 8 blocks long.
That's longer than the original 6 blocks used by the two files. The
tar utility is a bit mindless: It doesn't actually string files end
to end, rather, it strings blocks end to end. It also has to add directory
information into txt.tar, so it's not unusual (in fact it's common) for
a tar archive to be larger than the sum of its parts.
$ ls -ls
total 6
2 -rw-r--r-- 1 mjb group 3 Feb 04 23:31 minutes.txt
4 -rw-r--r-- 1 mjb group 1201 Feb 04 23:25 note.txt
$ tar cvf txt.tar *.txt
a minutes.txt 1 tape block
a note.txt 3 tape blocks
$ ls -ls
total 14
2 -rw-r--r-- 1 mjb group 3 Feb 04 23:31 minutes.txt
4 -rw-r--r-- 1 mjb group 1201 Feb 04 23:25 note.txt
8 -rw-r--r-- 1 mjb group 4096 Feb 05 01:40 txt.tar
$
This makes things look grim. Fortunately, tar fills
those empty spaces in the blocks with garbage, per the manual. In
fact, this garbage usually takes the form of hex zeroes or NULLs. This makes
a tar archive an excellent candidate for compression.
Proceeding to the next logical step, the following listing
compresses the tar archive. The resulting file txt.tar.Z is
404 bytes long (1 allocation unit or 2 blocks). Then, by removing the
original text files, the directory contents are reduced to only 2
blocks, saving 66 percent of the space previously used.
$ ls -ls
total 14
2 -rw-r--r-- 1 mjb group 3 Feb 04 23:31 minutes.txt
4 -rw-r--r-- 1 mjb group 1201 Feb 04 23:25 note.txt
8 -rw-r--r-- 1 mjb group 4096 Feb 05 01:40 txt.tar
$ compress txt.tar
$ ls -ls
total 8
2 -rw-r--r-- 1 mjb group 3 Feb 04 23:31 minutes.txt
4 -rw-r--r-- 1 mjb group 1201 Feb 04 23:25 note.txt
2 -rw-r--r-- 1 mjb group 404 Feb 05 01:40 txt.tar.Z
$ rm *.txt
$ ls -ls
total 2
2 -rw-r--r-- 1 mjb group 404 Feb 05 01:40 txt.tar.Z
$
The following listings show you how to reverse the tar-and-compress
process. The tar key argument for extracting from an archive is x .
The other key arguments are the same as in the earlier tar command.
$ ls -ls
total 2
2 -rw-r--r-- 1 mjb group 404 Feb 05 01:40 txt.tar.Z
$ uncompress txt.tar
$ ls -ls
total 8
8 -rw-r--r-- 1 mjb group 4096 Feb 05 01:40 txt.tar
tar xvf txt.tar
x minutes.txt, 3 bytes, 1 tape block
x note.txt, 1201 bytes, 3 tape blocks
$ ls -ls
total 14
2 -rw-r--r-- 1 mjb group 3 Feb 04 23:31 minutes.txt
4 -rw-r--r-- 1 mjb group 1201 Feb 04 23:25 note.txt
8 -rw-r--r-- 1 mjb group 4096 Feb 05 01:40 txt.tar
$ rm txt.tar
$ ls -ls
total 6
2 -rw-r--r-- 1 mjb group 3 Feb 04 23:31 minutes.txt
4 -rw-r--r-- 1 mjb group 1201 Feb 04 23:25 note.txt
So you can use tar and compress to save lots of disk
space on files that are rarely used. The GNU software gzip utility has
a few extra options, which make it more efficient than compress , especially
on large files.
Once you've located stashes of files that are rarely used, but which
must remain available, you should archive and compress them. For
directories with many little files, tar-and-compress or gzip
them. For directories with a few large files, compress or gzip them.
You may tar them if you like, but it will probably make little difference
in space used, though it might make things easier to administer.
Good luck squeezing more space out of those disk drives.
Contact
us for a free consultation. |