Hands-off editing with sed, Part 2
The advanced basics of sed
Summary
In December's Unix 101 column, Mo Budlong covered some of the basics of the sed command, showing you how to use it to conduct global search-and-replace operations on a file. This month, he takes the basics a bit further by providing an overview of pattern buffers, holding buffers, and address rules. (3,000 words)
In December's Unix 101 column, I covered some of the basics of the sed command, explaining how it can be used for global search-and-replace actions on a file. If that were all you could do with sed , it would be a useful tool. In fact, a version of grep , named gres , has recently been released that does exactly that. You can think of it as grep with replace, or sed with only the global search-and-replace option.
The sed utility goes way beyond search and replace, however, and you can use it to mangle a file beyond all recognition.
Before I launch you on a career as file mangler extraordinaire, you should understand what sed does when it processes a file.
At the simplest level, sed reads in a line of text, applies its first transformation rule to it, then the second, then the third, and so on, until all transformation rules are used up. The resulting changed line is then output. Here are some simple rules to remember when writing sed scripts:
- All editing commands in a script are applied, in order, to each line of input.
- Each command is applied to all the lines unless the line address limits the affected lines.
- Original files are unchanged. The output result is written to standard output.
The last rule is perhaps the most calming, because, to the new user, sed can seem scary and out of control, like an automated seek-and-destroy utility.
Let's look at the process a little more closely. The sed utility maintains a pattern space or pattern buffer. When sed swings into action, the first line of text is read into this pattern buffer. The first line of a sed script is applied to the pattern buffer, and may or may not change the contents of the pattern buffer. The second line of script is then applied to the pattern buffer, whether or not the buffer was changed by the first line of the sed script. This is an important point to remember: each line of script is not applied to the original line of text, but to text modified by all of the previous script lines.
Assume that you have a text file in which you want to change all examples of ms-dos system or MS-DOS system in a file to MS-DOS operating system . The following script will work because ms-dos is first converted to MS-DOS before the second line is applied to the pattern buffer.
s/ms-dos/MS-DOS/g
s/MS-DOS system/MS-DOS operating system
A sed command can be limited by one or more addresses. An address can be a physical record number in a file, or a regular expression identifying a line or range of lines. The following examples use the d (delete) command to illustrate some sample addressing. Note the difference between $ used as an address marking the last line of input, and /$/ used as the regular expression for an end-of-line character.
d (no address limit, deletes all lines)
1d (delete the first line)
10,15d (delete lines 10 through 15)
$d (delete the last line)
/Sacramento/d (delete any lines containing the word Sacramento)
/CUTHERE/,/TOHERE/d (delete all lines between CUTHERE and TOHERE)
/^$/d (delete all blank lines; ^=line start, $=line end)
/STOPHERE/,$d (delete everything from STOPHERE to end of input)
Now, let's say that you have attended an early run of Romeo and Ethel the Dancer, and the text of your review is ready for typesetting as listed below. The playwright calls you in a panic to inform you that he has opted for a simpler name -- Romeo and Juliet, of all things -- and you are stuck with a long and glowing review that you must rewrite. The sample below is the first of the 200 pages you produced after being a bit carried away by the performance. Note that a newline character terminates each line in the following text.
Romeo and Ethel the Dancer Moves Audience to Tears.
I was treated to the off Broadway opening of Romeo and Ethel the
Dancer. This moving story of star-crossed lovers had the
audience in tears half way through the third act.
Do not go to see this play without a hanky, but even the weeping
from the back row could not diminish the brilliance of Romeo and
Ethel the Dancer by William Shakespeare.
The first effort at a search-and-replace script would give you:
#romeo.sed
s/Romeo and Ethel the Dancer/Romeo and Juliet/g.
If you review the text, you will see that this will only match the title of the article, producing this confusing output.
Romeo and Juliet Moves Audience to Tears.
I was treated to the off Broadway opening of Romeo and Ethel the
Dancer. This moving story of star-crossed lovers had the
audience in tears half way through the third act.
Do not go to see this play without a hanky, but even the weeping
from the back row could not diminish the brilliance of Romeo and
Ethel the Dancer by William Shakespeare.
What is really needed here is some way of wrapping a search around line breaks. To do this effectively, we use a special syntax option of sed and a new command.
An address can be used to identify a range of lines over which multiple commands are to be executed. The syntax is as follows:
[address]{
command
command
etc.
}
What we want is to execute a subroutine (or a sub-search-and-replace procedure) whenever the word Romeo is found in the text. The first command is still valid; the following would be the skeleton for the next subcommand:
#romeo.sed
s/Romeo and Ethel the /Romeo and Juliet/g.
/Romeo/{
something here
}
The something here is really several steps:
- Read the next line into the pattern buffer
- Check whether the new pattern buffer contains
Dancer
- If it does, eliminate the newline character
- Replace the old title with the new
The following listing includes line numbers to make the explanation easier, although a sed script does not normally contain line numbers. This listing will be duplicated later in this article without line numbers, as some positions in the script are important -- particularly the slashes on lines 8 and 10, which must appear as the first character on the line.
1 #romeo.sed
2 s/Romeo and Ethel the Dancer/Romeo and Juliet/
3 /Romeo/{
4 $!N
5 /Dancer/{
6 s/\n/ /
7 s/Romeo and Ethel the Dancer\. */Romeo and Juliet.\
8 /
9 s/Romeo and Ethel the Dancer */Romeo and Juliet\
10 /
11 }
12 }
Line 1 includes a comment identifying the script. sed scripts may have a comment on the first line, beginning with a hash mark (# ). Comments are only valid on the first line, but an initial comment line can be continued by using the backslash (\ ) at the end of the line and continuing the comment by starting the next line with a hash mark.
Line 2 is the basic search-and-replace procedure that is being performed on the article. If the search text is contained entirely within one line, then this script line takes care of it.
Line 3 identifies any line containing Romeo as the address to which lines 4 through 12 must be applied.
Line 4 looks strange, but we'll break it down. The N command causes the next line to be pulled from the file and appended to the pattern buffer. The $! prefix adds a caveat: "Don't pull the next line if we're already on the last line." The sed utility has a little problem if you try to pull a line when you are already on the last line: it finds the end-of-file marker (that is, it notices that there are no more input lines) and just stops without outputting whatever is contained in the pattern buffer. The command at line 2 has already made sure that a match on a single line is taken care of, so we do not need to do a multiline match at the end of the file.
At line 5, an inner routine limits things even further. Having pulled another line, we only want to continue if the pattern buffer now also contains Dancer . Lines 6 through 10 are executed, if so. Line 6 replaces the newline character that was pulled by N with a space.
When the next line is pulled into the pattern buffer, newline characters are left in the buffer. In order to get rid of the newline, you must replace it with a space -- if you don't, further search commands would have to take that newline into account. Replacing it with a space in this manner thus simplifies matters.
If you replace the newline with nothing at all, then this:
Romeo and Ethel the
Dancer
Becomes this:
Romeo and Ethel theDancer
Forcing a space to replace the newline creates the correct spacing:
Romeo and Ethel the Dancer
Lines 7 through 10 match two possible versions of the title: lines 7 and 8 match the title at the end of a sentence, while lines 9 and 10 match the title within a sentence.
Two separate replacement texts are needed because the intent of the search and replace is to combine two lines of text that contain the words Romeo and Ethel the Dancer , replace that phrase with Romeo and Juliet , and then reformat this into two lines once again and output the result. For example the lines,
I was treated to the off-Broadway opening of Romeo and Ethel
the Dancer. This moving story of star-crossed lovers
are combined in the pattern buffer and become this:
I was treated to the off-Broadway opening of Romeo and Ethel the Dancer. This moving story of star-crossed lovers
If the replacement action were simply to replace Romeo and Ethel the Dancer with Romeo and Juliet followed by a newline, the period at the end of Dancer in the above example would be dropped down to the second line, as follows:
I was treated to the off-Broadway opening of Romeo and Juliet
. This moving story of star-crossed lovers
By including two possible search texts, one with a period and one without, the newline can be placed after the replacement text or after a period at the end of the original search text.
If the title appeared followed by a comma anywhere in the original article, a third version of the search text would be needed to handle that condition. There is a syntax for searching for a string followed by any punctuation, but it is beyond the scope of this article.
The title is changed and output with a period and a newline, or with a newline only. Look at the eighth and tenth lines in the listing below (note that the line numbers are now removed to correctly show the alignment of the backslashes). The slash appears at the beginning of the line. The complete replacement text includes everything up to the closing backslash, including the embedded newlines at the ends of the seventh and ninth lines.
#romeo.sed
s/Romeo and Ethel the Dancer/Romeo and Juliet/
/Romeo/{
$!N
/Dancer/{
s/\n/ /
s/Romeo and Ethel the Dancer\. */Romeo and Juliet.\
/
s/Romeo and Ethel the Dancer */Romeo and Juliet\
/
}
}
This gives us the effect we wanted.
Romeo and Juliet Moves Audience to Tears.
I was treated to the off Broadway opening of Romeo and Juliet.
This moving story of star-crossed lovers had the
audience in tears half way through the third act.
Do not go to see this play without a hanky, but even the weeping
from the back row could not diminish the brilliance of Romeo and Juliet
by William Shakespeare.
Sometimes it's useful to read in the contents of a file rather than
specifying it in the sed script. r filename causes the contents of filename to be read into the pattern buffer. The following example will insert the contents of the standard letter close in closing.txt whenever it encounters include closing :
/include closing/{
r closing.txt
}
The sed utility also has a second buffer, called the hold buffer. The following commands add and retrieve data from the hold buffer:
h replaces the hold buffer with the pattern buffer
H appends the pattern buffer to the hold buffer
g replaces the pattern buffer with the hold buffer
G appends the hold buffer to the pattern buffer
x swaps the hold and pattern buffers
Let's say we need to interchange the two paragraphs of the sample text with which we have been working. For this example we'll add two lines to our Romeo and Juliet review.
Romeo and Juliet Moves Audience to Tears.
LAST
I was treated to the off Broadway opening of Romeo and Juliet.
This moving story of star-crossed lovers had the
audience in tears half way through the third act.
TOP
Do not go to see this play without a hanky, but even the weeping
from the back row could not diminish the brilliance of Romeo and Juliet
by William Shakespeare.
The following script will read in from the line containing LAST to the line containing TOP :
#romeo2.sed
/LAST/,/TOP/{
H
d
}
${
x
}
As each line is read in, sed will append it to the hold buffer and delete it from the pattern buffer, so that it will not be displayed or output. Once sed hits the line containing TOP , it begins outputting regularly. When sed finally hits the end of the file, it uses x to swap the holding buffer into the pattern buffer, and the pattern buffer is output. The output needs a little editing to clean it up, but we have swapped the two paragraphs.
Romeo and Juliet Moves Audience to Tears.
Do not go to see this play without a hanky, but even the weeping
from the back row could not diminish the brilliance of Romeo and Juliet
by William Shakespeare.
LAST
I was treated to the off Broadway opening of Romeo and Juliet.
This moving story of star-crossed lovers had the
audience in tears half way through the third act.
TOP
There is much more to sed , but these examples should give you a sense of some of its more complex operations.
Contact
us for a free consultation. |