Search and replace with vi -- part 2

What's in an expression? Mastering the substitute regular expressions

Summary
Last month we showed you the basics of vi's search and replace features. The next part of the substitute command we cover is the search string itself and the powerful use of regular expressions make it possible to create complex search and replace commands. To become a vi master you need to understand regular expressions. (3,500 words)
Thanks to those of you who caught the typo in one of last month's code snippets. You go to the head of the class! If you read last month's column early in the month, you may want to see what it was.

That is an expression? If I say, "A bird in the hand is worth two in the bush," I don't mean that you must run out and trap two bush birds so that you can swap them for the one I am carrying around in my hand. It is an expression that stands for something else; a symbolic representation of another concept.

In a substitute command, "bat" as a search text stands for the three characters b, a, and t appearing one after the other. However, "[b-dh]at" as a search text does not stand for the eight characters left bracket, b, hyphen, d, h, right bracket, a, and t. Instead it is a regular expression that stands for something else. We will get to what it means in just a moment, but you have to approach it in simple steps.

The vi editor (actually the "ex" editor mentioned in last month's article) allows regular expressions to be created by setting aside certain standard characters and allowing them to have special meanings over and above the characters that they normally represent.

Before you start these examples type the following ex command starting with a colon and press ENTER.

:set magic

You may set magic or nomagic using the set command. vi special characters behave differently depending on the setting. The description below is in magic mode, which is the usual default for vi. However vi can be set up to use nomagic as default when it first starts. Typing set magic ensures that you are in magic mode. I will explain the effect of nomagic after we have had a look at the basic descriptions of regular expression special characters.

The simplest special character is the dot or period (.) which stands for any single character. The following command searches for h, followed by any character, followed by t and replaces it with host.

:%s/h.t/host/g

This command applied to Listing 1 produces Listing 2. Note that "h.t" has matched "hat" and "hut" as well as the "hat" in "hatter's" and "That" and the "h t" in "Bach to".

Listing 1

That hatter's magic hat
led Bach to the hut.

Listing 2

Thost hostter's magic host
led Bachosto the host.

The next useful special characters are caret (^) and ($) which stand for the beginning and end of a line. These two characters can be included in a search text to locate characters appearing at the beginning or end of a line, but they are not replaced by the replacement text. The following command searches for h, followed by any character, followed by t but only at the end of a line, and replaces it with boat.

:%s/h.t$/host/g

The command applied to Listing 1 would produce Listing 3. Only "hat" at the end of the first line has been replaced. Note that the end of the line itself has not been replaced, only the text at the end of the line. The beginning and end of the line indicate the position of a search text, but are unaffected by the replacement.

Listing 3

That hatter's magic host
led Bach to the hut.

When a caret is used to search for the beginning of a line, it is placed before the search text. The following command applied to Listing 1 would produce Listing 4. The search text consisting of any character followed by an e followed by any character could have matched "led", the "ter" in "hatter's", and "he" at the end of "the" in the second line. Because the caret was used to limit the search to the beginning of a line, Only "led" at the beginning of the second line has been matched and replaced.

:%s/^.e./brought/g

Listing 4

That hatter's magic hat
brought Bach to the hut.

The asterisk is a special character that is used to indicate zero or more occurrences of the previous character. The following command searches for zero or more occurrences of a space character and replaces them with a single space.

:%s/ */ /g

This command applied to Listing 5, a slightly different version of the tale of our magic hat, would produce Listing 6 by tightening up the extra spaces between sentences.

Listing 5

That    hatter's magic hat
led    Bach to    the hut.

Listing 6

That hatter's magic hat
led Bach to the hut.

If you need to search for a period or a dollar sign or a caret, or any of the other special characters (there are more to come) then precede the character with a backslash (\). The backslash can be used to "take away" the special meaning of a special character. The following command searches for a period -- which is entered as backslash period (\.) -- followed by zero or more spaces ( *) and replaces any that are found with a period and a single space. The period is not a special character in the replacement string, only in the search string so there is no need to precede it with a backslash in the replacement string.

:%s/\. */. /g

The backslash is used to convert a special character into a standard character, so it is itself a special character. If you want to search for a backslash you must precede it with a backslash. The following command searches for a backslash and replaces it with a hyphen.

:%s/\\/-/g

The next useful special character that you will use in a regular expression is the character set. A character set is entered as two or more characters that are treated as a selection of characters to search for. The characters can be entered as a list of characters (e.g. [ace] meaning a or c or e) or they can be entered as a range of characters by entering two characters separated by a hyphen (e.g. [a-c] meaning a or b or c). The characters may also be entered as any combination of a list and a range as in [a-cxz] meaning a or b or c ( a through c) or x or y. Note that the character set is surrounded by left and right square brackets. The following examples match a single character that falls within the described set.

Expression     Represents a single character in the set
[afh]          a or f or h
[a-d]          a or b or c or d
[afhx-z]       a or f or h or x or y or z (x through z)

The regular expression that introduced this section can now be translated. The following regular expression taken from the beginning of this article, will match bat, cat, dat, or hat.

[b-dh]at  =  b or c or d or h followed by "at"

A common use of the set option allows a search for an upper or lower case version of a letter. The following regular expression matches Rick or rick.

[Rr]ick  =  R or r followed by "ick"

Using these expressions for complex search and replace
Now you have the tools for a complex search and replace problem. Listing 7 is an example of the many different ways that "USA" has been typed into an address text file to identify the country of the address. A plan is afoot to search the file for duplicate names and addresses, but there are too many variations in address styles, "USA" being a single example. There would also be other problems with things such as apartment numbers, suite numbers, and so on. This example concentrates on the "USA" problem. To standardize it is decided that all versions will be converted to "USA" for the comparison.

Listing 7

USA
U S A
U.S.A
U. S. A.
usa
etc.

The following complex search and replace option will do the job.

:%s/[Uu]\.* *[Ss]\.* *[Aa]\.*/USA/g

Breaking this down it becomes: Search all lines for U or u, followed by zero or more periods, followed by zero or more spaces, followed by S or s, followed by zero or more periods, followed by zero or more spaces, followed by A or a, followed by zero or more periods. Replace it, when found, with "USA".

Listing 8 is the separate elements of the search string.

Listing 8

[Uu]      =    U or u
\.*       =    Zero or more periods
 *        =    Zero or more spaces
[Ss]      =    S or s
\.*       =    Zero or more periods
 *        =    Zero or more spaces
[Aa]      =    A or a
\.*       =    Zero or more periods

You can achieve similar results by setting the ignorecase option, abbreviated as ic. If you type the ex command (starting with a colon) shown below, then the search string becomes case insensitive.

:set ic

Once this set is done, the following command does the same search and replace because the search string becomes case insensitive.

 
:%s/u\.* *s\.* *a\.*/USA/g

To change back to case sensitive set noignorecase, abbreviated as noic, with the command below.

:set noic

Even if ignorecase is set, the [Uu] style of selecting upper or lower case works.

The value of the characters in a set can be reversed by including a caret as the first character of the set. The expression [^0-9] searches for any character that is not 0 through 9. The caret must be included as the first character in the set or it loses its inverting function. The expression [0-9^] searches for any character that is 0 through 9 or a caret.

The backslash must still be used inside the brackets of a set to "take away" the special meaning of a character. The expression [\.?!\*] searches for a period or a question mark or an exclamation point or an asterisk. However the expression [.?!] searches for any character or a question mark or an exclamation point that would be the same as searching for any character by simply using the dot.

The tilde ( ~ ) is another special character used in vi search strings. You will recall that an empty search string defaults to the previous search string used in a search command. The tilde stands for the previous replacement string used in a replacement command. The following commands search for "lft" and replace it with "left" then reverse the effect by searching for "left" and replacing it with "lft." The tilde in the second command is used to stand in for the first replacement text.

:%s/lft/left/g
:%s/~/lft/g

A more likely use of this special character would be to correct replacement errors. In the following two commands, the intention was to replace "lft" with "left" but "left" was incorrectly typed as "leff". The second command corrects the error by replacing "leff" with "left".

:%s/lft/leff/g
:%s/~/left/g

Understanding magic
This set of special characters that I have just covered is used frequently in regular expressions. The backslash to cancel the special meaning of characters does not always work inside left and right brackets. The expression [a\-c] which looks like it should mean a or hyphen or c causes a Re internal error. This error means that re, the regular expression parser can't understand what to do with the expression. The backslash will "take away" a special character's status when magic is set on (:set magic). When magic is set off (:set nomagic) the special value of all characters is removed except for ^ at the beginning of a regular expression, $ at the end and the backslash character itself. In order to create a special character, a backslash must be added to the character. For example the asterisk (*) which means zero or more repetitions of the preceding character loses that meaning when nomagic is set. To search for zero or more spaces and replace them with one space you would use:

:%s/ \*/ /g

Compare that to the same search and replace with magic set:

:%s/ */ /g

Another useful pair of special characters are created by combining two characters.

\<        =      Match only at the beginning of a word
\>        =      Match only at the end of a word

This pair of combination characters remains the same regardless of magic or nomagic settings. The following command searches for "wed" only as a whole word and replaces it with "married".

:%s/\<wed\>/married/g

This prevents the search string from matching the "wed" in "wedding" or "awed".

One final note on search strings. There are certain combinations of search string that are frequently used in sets that are so common that it becomes almost natural to think of them as a special search character themselves. For example [0-9] represents any character 0 through 9, which is easier to think of as meaning any digit. Likewise [0-9]* becomes zero or more digits and [0-9][0-9]* becomes one or more digits (one digit followed by zero or more digits). Listing 9 includes some of the combinations that you might become used to recognizing in a search pattern.

Listing 9

Expression Represents Comments

[0-9] Any Digit

[^0-9] Any non digit

[0-9]* Zero or more digits

[0-9][0-9]* One or more digits

[a-z] Any lower case letter

[A-Z] Any upper case letter

[a-zA-Z] Any letter

[^a-zA-Z] Any non letter

[a-zA-Z0-9] Any letter or digit

[a-zA-Z][a-z]* A word Any letter followed by zero or more lower case letters.

[a-z][a-z]* A lower case word One or more lower case letters.

[ ][ ]* White space One or more spaces or tabs Each pair of brackets contains a space character and a tab character between the brackets. Both characters are invisible but mean a space or a tab followed by zero or more spaces or tabs.

[^a-zA-Z0-9] Punctuation This also contains an invisible space and tab and means any character that is not a letter, a digit or white space.

Advanced search and destroy: Saving strings
Regular expressions in replacement strings are fairly simple compared to search strings, but they have their own special rules.

The simplest special character to use in a replacement is the ampersand (& for magic or \& for nomagic) which stands for the string just found by the search. To illustrate the use of the ampersand look at Listing 10. This is some sort of shopping list with prices. Because of the international nature of this shopping list it is necessary to add a symbol for the currency in which the prices are given, which in this case happens to be the Mexican peso which uses the dollar sign symbol.

What is needed is a search and replace command that will locate the prices and insert a leading "$".

Listing 10

beans                    19.95
peas                      5.17
potatoes                 12.00
carrots                  13.17

The following command will search for a digit, followed by zero or more digits followed by a decimal followed by zero or more digits. Whatever is found is replaced by a dollar sign followed by whatever string was found.

:%s/[0-9][0-9]*\.[0-9]*/$&/g

The effect of running this command on Listing 10 is shown in Listing 11. The search portion of the substitute command locates "19.95" in the first line and replaces it with "$" followed by what it just found, "19.95".

Listing 11

beans                    $19.95
peas                      $5.17
potatoes                 $12.00
carrots                  $13.17

This is perhaps one of the most powerful features of a vi search and replace (substitute) command: the ability to execute a regular expression search and save whatever string was matched by the search pattern so that the string can be used in the replacement text.

In fact the vi substitute command allows for even more granularity. It is possible to search for a string and use any portion of the found string in the replacement text. A search string can be marked with $ and $ to indicate text that is to be saved for use in the replacement string. This is a double character combination similar to the start and end of a word ("\<" and "\>") syntax used in a search string. The mark is created by using two characters to start and two characters to end the mark. This two character marking scheme is the same whether in magic or nomagic mode.

The string or strings that have been marked can be used in the replacement string by inserting \1, \2 and so on into the replacement string. The \1 stands for the first marked text, \2 stands for the second marked text.

A couple of examples will illustrate this more quickly than trying to explain it.

In the example in Listing 12, text about the results of a survey contains exact numbers.

While the report is precise, it is not a very comfortable read with all those long numbers. It would be better to present the results with less numbers and more English.

Listing 12

The population of the city is 14,493,122. Of these
5,217,640 responded to the survey. No less than 1,123,456
admit to being regular listeners. Most of the regular
listeners could identify 10 or more of the sponsors.
There were 2,134,678 occasional listeners and none of
Them could identify any of the sponsors.

In this case each of the millions of numbers is to be rounded so that 14,492,122 is changed to read "about 14 million".

The search string to do this is shown below.

:%s/\([0-9][0-9]*\),[,0-9]*/about \1 million/g

An analysis of the search and replace string makes it easier to follow.

:%s/                    In all lines search for
\([0-9][0-9]*\)         1 or more digits and mark them
,                       followed by a comma
[,0-9]*                 followed by 0 or more commas or digits
/about                  replace what is found with "about "
\1                      followed by the first marked text
 million/               followed by " million"
g                       do it globally

Listing 13 shows the results of this substitution.

Listing 13

The population of the city is about 14 million. Of these
about 5 million responded to the survey. No less than about 1 million
admit to being regular listeners. Most of the regular
listeners could identify 10 or more of the sponsors.
There were about 2 million occasional listeners and none of
Them could identify any of the sponsors.

This substitution causes a better read, but a bit too much detail is lost. More detail can be achieved by using the following substitution command:

:%s/\([0-9][0-9]*\),\([0-9]\)[,0-9]*/\1.\2 million/g

An analysis of this search and replace string is also useful.

:%s/                    In all lines search for
\([0-9][0-9]*\)         1 or more digits and mark them
,                       followed by a comma
\([0-9]\)               followed by 1 digit and mark it
[,0-9]*                 followed by 0 or more commas or digits
/about                  "about"
/\1                     replace what is found with the first marked text
.                       followed by a dot
\2                      followed by the second marked text
 million/               followed by "million"
g                       do it globally

Applying this substitute command, the result will be Listing 14.

Listing 14

The population of the city is 14.4 million. Of these
5.2 million responded to the survey. No less than 1.1 million
admit to being regular listeners. Most of the regular
listeners could identify 10 or more of the sponsors.
There were 2.1 million occasional listeners and none of
Them could identify any of the sponsors.

This text is still readable, but retains more of the accuracy of the original.

The substitute command in vi is very powerful, but it takes some practice to get used to it. This article has provided fairly thorough coverage of the substitute command. The substitute regular expressions that you have seen here are a subset of the regular expressions that can be used in sed, the stream editor, and grep and egrep, the search utilities. Learning those used in this article will help you with sed and grep. The search string regular expression options may also be used in a standard search command within vi and are not limited to search and replace.

Contact us for a free consultation.

MENU:

SOFTWARE DEVELOPMENT:

• EXPERIENCE

PRODUCTS:

UNIX:

• UNIX TUTORIALS

LEGACY SYSTEMS:

    • LEARN COBOL
    • PRODUCTS
    • GEN-CODE
    • COMPILERS

INTERNET:

• CYBERSUITE

WINDOWS:

• PRODUCTS