Go to the end of the line
Converting Word and MS-DOS documents to Unix
Summary
This month Mo explains how to move text documents from Word or DOS into a Unix environment. A couple of simple scripts can get rid of the annoying carats and such that appear when a Word document is opened in a Unix-based editor like vi. (2,400 words)
With the ability to move documents and text files so easily from one system to another, you would think that some common format could be
devised for information. Unfortunately, moving a Word document or a spreadsheet from a Windows environment to a Unix environment leaves
you with a document or spreadsheet that cannot be read or used by any application in the Unix world. There are exceptions -- but very few.
Ah, but what about the humble text file? Surely files containing nothing but ASCII characters must be portable across multiple systems. Well, that's
true ... mostly.
The gotcha in moving text files between systems is most apparent in moving text from an MS-DOS or Windows environment to a Unix
environment. This problem comes up often because these three operating systems are so common. Of course, you and I know that DOS is dead.
Just don't tell that to all those people running those thousands of vertical market DOS applications -- none of which were ever ported to Windows,
since their software vendors went out of business while trying to do the ports.
Aside from the need to move text files in general, you no doubt also have many general purpose programs, probably written in C code, that you'd like to port from DOS to Unix or vice versa. ("Gee, Edwina, remember that utility that you wrote to unscramble the framis-gaggle? I bet you could move that code to Unix and it would compile and run there.") Yes, she probably could move it, but the text files that contain the source code for the framis-gaggle unscrambler will use a different end-of-line marker in the Unix world than it did under DOS or Windows, resulting in some tedious editing for poor Edwina. The same problem will arise for users who just want to move plain old text documents from platform to platform. Why should this be the case?
You have probably heard terms like carriage return, line feed, and newline bandied about in relationship to text files and printing, but you might not know exactly what they are, and how they relate to text files.
Back when dinosaurs ruled the earth, the primary method of outputting information from a computer was a printer or teletypewriter. (A vestigial memory of the latter piece of equipment is retained in the Unix designation for a terminal -- tty, an abbreviation for teletypewriter.) One of the important factors in controlling output to the printer or teletypewriter was the subject of carriage control. Printers and teletypewriters had a platen, or cylinder, which you can still see today on dot matrix and other impact printers. Paper was fed through the printer by rolling it around the platen, or, in the case of pin feed paper, by feeding it through a tractor feeder. The tractor feeder and the platen were both carriages, and their primary function was to carry the paper in a precision manner and position it in front of the print head. The mechanism that moved the print head back and forth was also considered part of the carriage; the carriage as a whole was responsible for the amount of space between each printed line, as well as the positioning of the print head before each line was printed.
Early printers attached to IBM's big iron expected to receive two (sometimes three) bytes of information at the start of each line from the computer; these bytes contained information on where to print the upcoming line of characters on the piece of paper. Carriage control commands ranged from the very simple (PRINT AFTER ADVANCING 1 LINE) to the complex (PRINT BEFORE ADVANCING 3 VERTICAL TABS).
Because carriage control allowed control over vertical movement on the printed page, two (now well known) font variations could be created. By printing a line and then printing it again without advancing a line, you could print the same characters in the same place twice, creating bold type. By printing a line and then printing a line of underscores and spaces without advancing a line, you could print a line containing underlining. You could also print a line or a character, backspace, and then print a hyphen or a slash for a strike through, although this was less common.
IBM printers were frequently fed by large bundles of wires that carried carriage control and printing signals separately. When smaller printers were developed that were connected by parallel printer or serial port connections, it became necessary to send bytes of printer control information as part of the data stream. Thus, the ASCII (American Standard Code for Information Interchange) character set includes control characters in addition to the standard letters and numbers. A control character is a single character that can be sent to a computer device, such as a printer or monitor, that controls the behavior of that device, rather than printing an actual character.
ASCII uses 128 numbers to represent all the uppercase and lowercase characters of the alphabet, the digits, the punctuation characters, and these special characters that are used to control printers, terminals, and other computer devices. The 128 values are numbered beginning with zero, so the numbers used range from 0 through 127. All the printable characters (letters, digits, and punctuation) have values between 32 and 126. The values 0 through 31 and 127 are used for control characters.
The table below is a brief ASCII chart with the decimal value of each entry and its ASCII name or character. Several of the ASCII codes represent the nonprintable characters, and these are given with their names. You might already be familiar with some of these.
0 NUL 32 SP 64 @ 96 '
1 SOH 33 ! 65 A 97 a
2 STX 34 " 66 B 98 b
3 ETX 35 # 67 C 99 c
4 EOT 36 $ 68 D 100 d
5 ENQ 37 % 69 E 101 e
6 ACK 38 & 70 F 102 f
7 BEL 39 ' 71 G 103 g
8 BS 40 ( 72 H 104 h
9 HT 41 ) 73 I 105 i
10 LF 42 * 74 J 106 j
11 VT 43 + 75 K 107 k
12 FF 44 , 76 L 108 l
13 CR 45 - 77 M 109 m
14 SO 46 . 78 N 110 n
15 SI 47 / 79 O 111 o
16 DLE 48 0 80 P 112 p
17 DC1 49 1 81 Q 113 q
18 DC2 50 2 82 R 114 r
19 DC3 51 3 83 S 115 s
20 DC4 52 4 84 T 116 t
21 NAK 53 5 85 U 117 u
22 SYN 54 6 86 V 118 v
23 ETB 55 7 87 W 119 w
24 CAN 56 8 88 X 120 x
25 EM 57 9 89 Y 121 y
26 SUB 58 : 90 Z 122 z
27 ESC 59 ; 91 [ 123 {
28 FS 60 < 92 \ 124 |
29 GS 61 = 93 ] 125 }
30 RS 62 > 94 ^ 126 ~
31 US 63 ? 95 _ 127 DEL
|
ASCII chart with decimal values
Most of the nonprintable characters were and are used for communications protocols and have no real use for most applications programmers today. Even the others seem primitive in today's world of slick GUIs and laser printers. For example, the value 13 (CR) is a carriage return. When this value is sent to a printer, it causes the print head to return to column 1. A CR also is sometimes sent by the Return or Enter key on the keyboard, although Unix usually translates this as value 10, a line feed (LF). This latter control character is used to move a printer or terminal up one line. Value 7 (BEL), when sent by the computer to the terminal, usually causes a beep or rings an alarm. HT (horizontal tab, or just plain tab), value 9, is sent to a printer or a screen and causes the cursor or print head to advance to the next print column. SO (shift out) and SI (shift in), values 14 and 15, are also used in printer control. Many printers are set up with two built-in fonts. Sending an SI causes the printer to shift to the second font, while an SO causes it to shift back to the original typeface.
The values from 33 through 126 are printable characters. Value 32 (SP) is a space. Whether a space is actually a printable character is a debatable point, since a space does not usually put ink on the paper. Instead, it places a character containing no image. Some printers render this by simply advancing the print head one position.
The characters in the range below 32 are used extensively in telecommunications. For example, 2 and 3, STX (start of transmission) and ETX (end of transmission), are often used at the start and end of a block of transmitted information, respectively. 6 and 21, ACK and NAK, are often used by a receiving computer to signal an acknowledgement (ACK for well received) or a negative acknowledgement (NAK for not well received, please retransmit).
Control characters are also used inside text files to indicate the end of a line, and here is where our problem lies. Unix uses a single LF (line feed, ASCII value 10) character to designate this. DOS and Windows use a combination of CR (carriage return, value 13) and LF. Moving a text file back and forth between these two systems without translating the end-of-line marker causes some unusual results. For example, the MS-DOS Edit utility is smart enough to recognize a file that only has a line feed for an end-of-line marker and displays it correctly, but the Windows Notepad utility is not. Notepad displays an untranslatable control character as a thick black vertical bar that looks like a black box. In the following listings this black box is shown as a pair of square bracket (like this: []).
A Unix text file in MS-DOS Edit
These are the times
that try men's souls.
The Metropolitan Transit Authority,
better known as the MTA
etc.
Notepad cannot figure out where the lines in the same file end.
A Unix text file in Notepad
These are the times[]that try men's souls.[]The Metropolitan Transit Authority,[]better known as the MTA[]etc.
In the reverse case, a Windows text file has too many control characters for vi. The extra carriage return shows up as a control-M (^M) in the vi display.
A Windows/DOS text file in vi
These are the times^M
that try men's souls.^M
The Metropolitan Transit Authority,^M
better known as the MTA^M
etc.^M
Many Unix/Windows transfer utilities include a switch that can be set to indicate that a text file is being transferred, and the resulting file has its end-of-line character(s) translated. Some utilities have text translation as the default, and you must set a switch to suppress the translation when you are transferring binary files.
Unfortunately, most serious movement of files in volume is done by combining and compacting the files using one of the versions of zip, tar, or what have you, and the resulting file must be transferred as a binary file. The individual files within such an archive do not have their end-of-line characters translated when they are combined and transferred as a binary.
Conversion scripts
The extra carriage return can be removed or inserted with two simple scripts. I use scripts here so that you can save and reuse them. The first one takes two command arguments, the Unix file name and the DOS/Windows file name. It adds a carriage return to a Unix text file and outputs it under the DOS text file name so that it can be transferred to DOS. To enter this with vi, when you get to the ^M, type control-V then control-M. The control-V causes the next character to be inserted as a literal control character. After you have saved it as lf2crlf , change its mode to allow execution (chmod a+x lf2crlf ).
# lf2crlf
# adds an extra carriage return in a unix
# text file so that end of line matches
# the Windows/DOS convention
usage()
{
echo "usage: lf2crlf unix.txt dos.txt
exit
}
if [ $# != 2 ]
then
usage
fi
sed 's/$/^M/g' <$1 >$2
For text files coming from Windows or DOS to Unix, the second script strips the extra CR from the end of each line. There is an additional hook in MS-DOS files. Some DOS editors and utilities append a control-Z (value 26 or SUB in ASCII) to the end of a text file, which will display in vi as a ^Z. This script also removes that character. Note that the single quoted portion starts on one line and ends on the second. Use control-V, control-M to create the ^M and control-V, control-Z to create the ^Z.
# crlf2lf
# removes an extra carriage return in a dos/widows
# text file so that end of line matches
# the Unix convention.
# Also removes a control-Z at end of file
usage()
{
echo "usage: crlf2lf dos.txt unix.txt
exit
}
if [ $# != 2 ]
then
usage
fi
sed 's/^M//g
s/^Z//g' <$1 >$2
Just as a final note: some DEC systems used only a carriage return to mark an end of line. I once ported a C application from Unix to DOS and then from Unix to VAX. The end-of-line terminators had to be handled to move the source code from one system to another.
Contact
us for a free consultation. |