Grep Notes

With help from Unix Hero Geraint Howells, MSUB author Anders Munch, and a variety of Usenet posters.

Recursive grep

There are a variety of ways of applying commands, such as grep, to all the files in a directory, including sub-directories. They are based on the xargs and find commands. The find command is especially useful and is well worth memorising.

Let's say we are looking for all mentions of a variable named “counter” in source code files, which have names ending in .c and .h.

find ./ -name '*.c' | xargs grep counter

This is equaivalent to:

grep foo `find ./ -name '*.c'` )

The following will print the filenames of the files containing foo.

find ./ -name '*.c' | xargs grep -l foo

Some alternative methods are:

find ./ -name '*.c' -exec grep foo {} \;
find ./ -name '*.[hc]' -exec grep -l foo {} \;
find ./ -name '*.c' -o -name "*.h" -type f -exec grep foo {} \;

The -exec method tends to be slower because grep is executed once per file found. If the command to be executed (in this case grep) can take multiple files as an argument, the xargs, or reverse quotes, methods are faster. This is because grep is executed only once (with the concatenation of the files found) instead of many times. This is especially true when recursing into a big directory structure.

The advantage of the exec method is that if it returns 0 (found) then any subsequent arguments are also evaluated. Using this fact, the following could be performed:

find ./ -name '*.c' -exec grep foo {} \; -print

In this case, if foo is found in a file, the name of the file is printed.

Finally, GNU grep 2.3 supports recursion with the -r, --recursive --directories=recurse options.

Everything Except

If matching everything up-to a specific character, use the negation set, i.e. [^expression] where the expression is the character that terminates the wanted string. If the last character is to be retained, use context markers to exclude it, or simply add it to the replace string.

For instance, '<P'[^'>']+'>' will match any P tag, regardless of the content between the P and the closing >.

Shortest and Longest Match

When using the shortest match, remember that multi-character expression may not work as expected. For instance, if we use '<'/[^'>']+/' to match the contents of tags, we'll find it only matches the first character within the brackets.

This is because the shortest match directive forces the multi-character expression to match the shortest possible string. This may seem obvious, but sometimes, one specifies the shortest match for a search pattern elsewhere in the script, forgetting the effect it has on other patterns.

Matching Everthing Until Something

To match everything (characters and line separators) between two delimiting expressions, it is OK to [^]*, even though MSUB's author claims in the documentation that this expression is "dangerous".

For example:

!shortestmatch
<P> [^]* </P>

will match anything between a P tag, and its closing tag. Note that without the !shortestmatch directive, the expression would match everything between the first opening P tag and the last closing P tag in the entire file, including any opening and closing P tags in between.

Making Expressions More Readable

I keep forgetting that unquoted spaces in the search expression are ignored. One can place spaces between different parts of the search expression to make it more understandable.

Strange Things

MSUB claims that the characters =, <, and > are special and should be quoted, even though they don't seem to be mentioned in any regular expression documentation I've seen. This is because the author has reserved them for future developments.

MSUB claims that the dot expression should be quoted, even when I want it to behave as a special character.

Puzzling Output

When using the ask option, MSUB steps through the file one change at a time. At each change, it displays the text about to be changed, and what it would look like after the change. Four lines of the text before the change are shown, with the text to be changed on the third line. Only a single line of what the text would look line after the change is shown.

Problems occur where the Line Feed character (start-of-line or $ in RE terminology) occurs in the search-for string. This is often used to anchor the string to a certain position on a line. The problem is that I often forget to replace the Line Feed in the replace-with string.

Displayed lines in the before text of subsequent changes will overlap if they contain the results of a previous replace that dropped a Line Feed. This is very confusing if you've forgotten about this Line Feeds problem!

Avoid dropping Line Feeds by:

  1. Placing the Line Feed in the replace-with string or,
  2. Using context markers (/) in the search-for string to exclude the Line Feed from the actual replacement.

MSUB Command-Line Quoting

Let's say we have some files that, in many places, contain the string "IMPORTANT POINT" (without the quotes) followed by some text, all on one line. Suppose we want to change the string "IMPORTANT NOTE" separately on its own line with the text that was on the same line moved the the line below.

The search pattern is Start-of-Line, followed by "IMPORTANT POINT". If we want to use a pattern containing a literal space, we must quote it twice. The outer quotes protect the pattern from the command-line interpreter, the inner quotes are for MSUB's benefit.

msub -text -search "^'IMPORTANT POINT'" -replace "^'PLEASE NOTE'$" filename.txt

The replace pattern is Start-of-Line, followed by "PLEASE NOTE", followed by End-of-Line. Thus, the old "IMPORTANT POINT" string is removed and replaced by "PLEASE NOTE" on its own line, with the following text (which was on the same line) now on the line below.

Gotcha 1:The pattern above finds the text correctly, but doesn't replace correctly. The problem is that the text that followed the "IMPORTANT POINT" has lost its Start-of-Line, since this was part of the search string. The replacement Start-of-Line only applies to the "PLEASE NOTE" line. So, to avoid this simple problem, we merely include an additional Start-of-Line to the end of the replace string, thus -replace "^'PLEASE NOTE'$^"

Gotcha 2:In the MSUB manual, the examples of command-line quoting show the whole search and replace strings being surrounded by both sets of quotes. Being in a rush, I followed the example blindly, and included some special characters inside both quotes, e.g. -search="'^IMPORTANT POINT'". This didn't work as expected since the ^ (Start-of-Line) character was being quoted, so MSUB thought I meant an ordinary caret symbol, rather than a Start-of-Line. Easily fixed - the inner quotes need to go only around text that needs quoting.

MSUB Limitation

MSUB can lose its place when storing context markers due to a legitimate design decision (speed at the expense of power). Context markers are used when you only want to replace a part of the full search expression. The context markers are placed either side of the part to be replaced. MSUB displays a warning message if there is a chance that the markers may not be honoured.

In the following example, the replace string worked fine, except that it retained the line feed, resulting in the weird output as described above.

^.+[^'.'] /$^/ .+$ 

The intention was to remove line-wrapping from a text document, for use in a word processor. The expression to the left of the first context marker specifies a general line of text that doesn't end in a full stop. The context markers tell MSUB to only replace the line separator (EOL SOL) between lines of text.

In this case, MSUB only removed the SOL (Line Feed) and left the EOL (Carriage Return), resulting in the output lines overlapping each other.


Home About Me
Copyright © Neil Carter

Content last updated: 2002-11-21