Compressing files in Unix

Understanding how disk space is allocated can help you get more from your drives

Summary
This month we'll cover the basics of file compression in Unix. We'll explain why small files may be larger than large files, why one file may be better than two, and what you can do to squeeze more space out of your disk, including how to use the tape archive (tar) utility to better compress small files. (2,100 words)

How big is a file, anyway?

You might expect the answer to that to be pretty simple. Type ls-l and the directory listing tells you how many bytes are in the file. In the example listing below, minutes.txt is 3 bytes long (must have been a short meeting) and note.txt is 1201 bytes long.

$ ls -l
total 6
-rw-r--r--   1 mjb     group        3 Feb 04 23:31 minutes.txt
-rw-r--r--   1 mjb     group     1201 Feb 04 23:25 note.txt

But those are the file sizes, not the amount of space used on the disk. To see the space used on the disk add the -s switch by typing ls -ls. The new listing (shown below) includes an initial column that contains the number of blocks used on the disk by the file. A block is a unit of 512 bytes. The first file, minutes.txt, uses 2 blocks or 1024 bytes, (suddenly that meeting doesn't seem so short) and note.txt uses 4 blocks, a whopping 2048 bytes.

$ ls -ls
total 6
   2 -rw-r--r--   1 mjb     group        3 Feb 04 23:31 minutes.txt
   4 -rw-r--r--   1 mjb     group     1201 Feb 04 23:25 note.txt

What's happening here? It would be nice if a 3-byte file actually used 3 bytes, plus maybe a few bytes for the file name and other information in the directory. Unfortunately that's never been a practical way to organize a disk. The overhead in keeping track of the directory would become a load on the system. Also, as a file expanded and contracted, due to editing and data entry, it would become heavily fragmented. The file's first 3 bytes would be down on track 12, the next 14 bytes over on track 25, and future additions spread out to track 64. The directory would become a hodgepodge of pointers, and loading files into the editor would require the read heads to scramble all over the disk collecting directory information and tiny bits of file.

To handle this problem, a compromise was reached in disk organization. A convenient number of bytes was selected as the minimum amount that could be allocated to a file. This amount could be called an allocation unit. If a file didn't use all the space in its allocation unit, the remainder of the unit would be set aside for future expansion. As a file expanded, so long as it didn't exceed the number of bytes in its allocation unit, all new information was stored in the reserved space on the disk. Once the file exceeded that space, another allocation unit was grabbed and reserved. Any spill over from the first allocation unit would be tucked into the new one, and so on. Now the directory had only to locate the first allocation unit. This method is used in all major operating systems in one form or another.

Earlier Unix systems used an allocation unit of 512 bytes. These 512 bytes made up 1 block. As disk sizes grew, the basic allocation unit was increased to 1024 bytes on most systems (larger on some), but many utilities, such as ls, continue to report file sizes or disk use in 512-byte blocks. This block size remains the standard for many utilities, even though the actual size of an allocation unit has increased to 2 or more blocks.

Black holes
With this background, let's now look at the ls- ls listing again. A 3-byte file like minutes.txt will occupy a 512-byte block, but more importantly for disk usage, it will occupy 1 allocation unit, which on the system in the example below is 2 blocks, or 1024 bytes. The ls -ls listing correctly indicates 2 blocks used on the disk. Similarly, note.txt is 1,201 bytes, and should therefore occupy 3 blocks (1024 = 2 complete blocks plus an additional 177 bytes in a third block). The note.txt file actually uses 2 allocation units or 4 blocks, as indicated in the listing.

$ ls -ls
total 6
   2 -rw-r--r--   1 mjb     group        3 Feb 04 23:31 minutes.txt
   4 -rw-r--r--   1 mjb     group     1201 Feb 04 23:25 note.txt

This seems dreadfully wasteful. In fact 99.7 percent of the space allocated for minutes.txt is unused, and 41.4 percent of the space for note.txt is wasted. Multiply this by the number of files on the system, and you'll begin to imagine vast black holes of disk space that cannot be reached except by forcing all users to create and fill files that are multiples of 1024.

Before you start hyperventilating, remember that the high percentage of waste only occurs on very small files, so the larger the file the more efficient the allocation system is. If you allow that your system is probably working fairly well, you'll recognize that the allocation system is a good compromise between disk allocation and speed of disk access.

One useful task is to establish the allocation unit size of your system. You can probably plow through manuals for this information, but a simpler method is to read the manpage for ls. Establish the block size used by the -s option (usually 512 bytes), then use vi to create a file with only a few bytes in it and close the file. Type ls -ls to look at the number of blocks used for that small file, multiply it by the block size, and you have your basic allocation unit size.

One other useful note: ls-l, ls -s, and similar variations display a total line as the first record in a directory display. The total 6 in the listing below is in fact the sum of the blocks displayed by typing ls -ls.

$ ls -ls
total 6
   2 -rw-r--r--   1 mjb     group        3 Feb 04 23:31 minutes.txt
   4 -rw-r--r--   1 mjb     group     1201 Feb 04 23:25 note.txt

Compression
Now that you've identified one type of disk-eating file, what can you do about it?

You've probably heard of or even used some of the various file-compression utilities, such as pack, compress, and the GNU (which stands for GNU's not Unix) software utility, gzip.

These utilities work very well on large files, but perform poorly on small files. In the sample listing below, compress is applied to each of the files and the results are displayed. The compress utility correctly recognizes that it can't do any good on minutes.txt, and leaves it alone. It does, however, compress note.txt to 188 bytes. Note that compress appends .Z to a file when it compresses it. The effects of compress are reversed by using uncompress, or compress -d file.ext. You don't need to include the .Z in the file name.

In this case we've eliminated 2 blocks, as note.txt compressed down to 2 blocks from 4. If you follow this logic through, you begin to realize that a small file can never be compressed below 2 blocks (or the default allocation unit for your system).

$ ls -ls
total 6
   2 -rw-r--r--   1 mjb     group        3 Feb 04 23:31 minutes.txt
   4 -rw-r--r--   1 mjb     group     1201 Feb 04 23:25 note.txt
$ compress minutes.txt
$ compress note.txt
ls -ls
total 4
   2 -rw-r--r--   1 mjb     group        3 Feb 04 23:31 minutes.txt
   2 -rw-r--r--   1 mjb     group      188 Feb 04 23:25 note.txt.Z
$

Compressing files with tar
If you have a directory of small files that are little used, but need to remain on the system, one way to handle them is to combine them into one file, then remove the originals. If the files can be strung together, all the little files can be packed into one larger file. The obvious candidate for this combining action is the tar (tape archive) utility, so we'll try it.

Study the following listing for a moment. The tar command uses key letters to signify actions to be performed. These are a bit like command line switches, but are not preceded by -. In this instance, the tar arguments are:

c create a new archive

v verbose, provide information on what you are doing

f the next argument is the name of the archive to create txt.tar

txt.tar the archive that is being created

*.txt the list of files to include in the archive

Immediately after you type the tar command, tar informs you that it has appended minutes.txt, which would take 1 tape block (a minutes.txt 1 tape block), and appended note.txt, which would take 3 tape blocks (a note.txt 3 tape blocks). So tar reports it results in 512-byte blocks rather than 1024-byte double blocks.

However, there is a bit of a shock in the ls -ls command issued after the tar is complete. The new archive txt.tar is 8 blocks long. That's longer than the original 6 blocks used by the two files. The tar utility is a bit mindless: It doesn't actually string files end to end, rather, it strings blocks end to end. It also has to add directory information into txt.tar, so it's not unusual (in fact it's common) for a tar archive to be larger than the sum of its parts.

$ ls -ls
total 6
   2 -rw-r--r--   1 mjb     group        3 Feb 04 23:31 minutes.txt
   4 -rw-r--r--   1 mjb     group     1201 Feb 04 23:25 note.txt
$ tar cvf txt.tar *.txt
a minutes.txt 1 tape block
a note.txt 3 tape blocks
$ ls -ls
total 14
   2 -rw-r--r--   1 mjb     group        3 Feb 04 23:31 minutes.txt
   4 -rw-r--r--   1 mjb     group     1201 Feb 04 23:25 note.txt
   8 -rw-r--r--   1 mjb     group     4096 Feb 05 01:40 txt.tar
$

This makes things look grim. Fortunately, tar fills those empty spaces in the blocks with garbage, per the manual. In fact, this garbage usually takes the form of hex zeroes or NULLs. This makes a tar archive an excellent candidate for compression.

Proceeding to the next logical step, the following listing compresses the tar archive. The resulting file txt.tar.Z is 404 bytes long (1 allocation unit or 2 blocks). Then, by removing the original text files, the directory contents are reduced to only 2 blocks, saving 66 percent of the space previously used.

$ ls -ls
total 14
   2 -rw-r--r--   1 mjb     group        3 Feb 04 23:31 minutes.txt
   4 -rw-r--r--   1 mjb     group     1201 Feb 04 23:25 note.txt
   8 -rw-r--r--   1 mjb     group     4096 Feb 05 01:40 txt.tar
$ compress txt.tar
$ ls -ls
total 8
   2 -rw-r--r--   1 mjb     group        3 Feb 04 23:31 minutes.txt
   4 -rw-r--r--   1 mjb     group     1201 Feb 04 23:25 note.txt
   2 -rw-r--r--   1 mjb     group      404 Feb 05 01:40 txt.tar.Z
$ rm *.txt
$ ls -ls
total 2
   2 -rw-r--r--   1 mjb     group      404 Feb 05 01:40 txt.tar.Z
$

The following listings show you how to reverse the tar-and-compress process. The tar key argument for extracting from an archive is x. The other key arguments are the same as in the earlier tar command.

$ ls -ls
total 2
   2 -rw-r--r--   1 mjb     group      404 Feb 05 01:40 txt.tar.Z
$ uncompress txt.tar
$ ls -ls
total 8
   8 -rw-r--r--   1 mjb     group     4096 Feb 05 01:40 txt.tar
tar xvf txt.tar
x minutes.txt, 3 bytes, 1 tape block
x note.txt, 1201 bytes, 3 tape blocks
$ ls -ls
total 14
   2 -rw-r--r--   1 mjb     group        3 Feb 04 23:31 minutes.txt
   4 -rw-r--r--   1 mjb     group     1201 Feb 04 23:25 note.txt
   8 -rw-r--r--   1 mjb     group     4096 Feb 05 01:40 txt.tar
$ rm txt.tar
$ ls -ls
total 6
   2 -rw-r--r--   1 mjb     group        3 Feb 04 23:31 minutes.txt
   4 -rw-r--r--   1 mjb     group     1201 Feb 04 23:25 note.txt

So you can use tar and compress to save lots of disk space on files that are rarely used. The GNU software gzip utility has a few extra options, which make it more efficient than compress, especially on large files.

Once you've located stashes of files that are rarely used, but which must remain available, you should archive and compress them. For directories with many little files, tar-and-compress or gzip them. For directories with a few large files, compress or gzip them. You may tar them if you like, but it will probably make little difference in space used, though it might make things easier to administer.

Good luck squeezing more space out of those disk drives.

Contact us for a free consultation.

MENU:

SOFTWARE DEVELOPMENT:

• EXPERIENCE

PRODUCTS:

UNIX:

• UNIX TUTORIALS

LEGACY SYSTEMS:

    • LEARN COBOL
    • PRODUCTS
    • GEN-CODE
    • COMPILERS

INTERNET:

• CYBERSUITE

WINDOWS:

• PRODUCTS