With help from Unix Hero Geraint Howells, MSUB author Anders Munch, and a variety of Usenet posters.
There are a variety of ways of applying commands, such as grep, to all the files in a directory, including sub-directories. They are based on the xargs and find commands. The find command is especially useful and is well worth memorising.
Let's say we are looking for all mentions of a variable named “counter” in source code files, which have names ending in .c and .h.
find ./ -name '*.c' | xargs grep counterThis is equaivalent to:
grep foo `find ./ -name '*.c'` )The following will print the filenames of the files containing foo.
find ./ -name '*.c' | xargs grep -l fooSome alternative methods are:
find ./ -name '*.c' -exec grep foo {} \;The -exec method tends to be slower because grep is executed once per file found. If the command to be executed (in this case grep) can take multiple files as an argument, the xargs, or reverse quotes, methods are faster. This is because grep is executed only once (with the concatenation of the files found) instead of many times. This is especially true when recursing into a big directory structure.
The advantage of the exec method is that if it returns 0 (found) then any subsequent arguments are also evaluated. Using this fact, the following could be performed:
find ./ -name '*.c' -exec grep foo {} \; -printIn this case, if foo is found in a file, the name of the file is printed.
Finally, GNU grep 2.3 supports recursion with the -r, --recursive --directories=recurse options.
If matching everything up-to a specific character, use the negation set, i.e. [^expression] where the expression is the character that terminates the wanted string. If the last character is to be retained, use context markers to exclude it, or simply add it to the replace string.
For instance, '<P'[^'>']+'>'
will match any
P tag, regardless of the content between the P and the closing
>.
When using the shortest match, remember that multi-character
expression may not work as expected. For instance, if we use
'<'/[^'>']+/'
to match the contents of tags,
we'll find it only matches the first character within the
brackets.
This is because the shortest match directive forces the multi-character expression to match the shortest possible string. This may seem obvious, but sometimes, one specifies the shortest match for a search pattern elsewhere in the script, forgetting the effect it has on other patterns.
To match everything (characters and line separators) between two
delimiting expressions, it is OK to [^]*
, even though
MSUB's author claims in the documentation that this expression is
"dangerous".
For example:
!shortestmatch
<P> [^]* </P>
will match anything between a P tag, and its closing tag. Note
that without the !shortestmatch
directive, the
expression would match everything between the first opening P tag
and the last closing P tag in the entire file, including any opening
and closing P tags in between.
I keep forgetting that unquoted spaces in the search expression are ignored. One can place spaces between different parts of the search expression to make it more understandable.
MSUB claims that the characters =, <, and > are special and should be quoted, even though they don't seem to be mentioned in any regular expression documentation I've seen. This is because the author has reserved them for future developments.
MSUB claims that the dot expression should be quoted, even when I want it to behave as a special character.
When using the ask option, MSUB steps through the file one change at a time. At each change, it displays the text about to be changed, and what it would look like after the change. Four lines of the text before the change are shown, with the text to be changed on the third line. Only a single line of what the text would look line after the change is shown.
Problems occur where the Line Feed character (start-of-line or $ in RE terminology) occurs in the search-for string. This is often used to anchor the string to a certain position on a line. The problem is that I often forget to replace the Line Feed in the replace-with string.
Displayed lines in the before text of subsequent changes will overlap if they contain the results of a previous replace that dropped a Line Feed. This is very confusing if you've forgotten about this Line Feeds problem!
Avoid dropping Line Feeds by:
Let's say we have some files that, in many places, contain the string "IMPORTANT POINT" (without the quotes) followed by some text, all on one line. Suppose we want to change the string "IMPORTANT NOTE" separately on its own line with the text that was on the same line moved the the line below.
The search pattern is Start-of-Line, followed by "IMPORTANT POINT". If we want to use a pattern containing a literal space, we must quote it twice. The outer quotes protect the pattern from the command-line interpreter, the inner quotes are for MSUB's benefit.
msub -text -search "^'IMPORTANT POINT'" -replace "^'PLEASE NOTE'$" filename.txt
The replace pattern is Start-of-Line, followed by "PLEASE NOTE", followed by End-of-Line. Thus, the old "IMPORTANT POINT" string is removed and replaced by "PLEASE NOTE" on its own line, with the following text (which was on the same line) now on the line below.
Gotcha 1:The pattern above finds the text correctly, but
doesn't replace correctly. The problem is that the text that
followed the "IMPORTANT POINT" has lost its Start-of-Line, since
this was part of the search string. The replacement Start-of-Line
only applies to the "PLEASE NOTE" line. So, to avoid this simple
problem, we merely include an additional Start-of-Line to the end of
the replace string, thus -replace "^'PLEASE
NOTE'$^"
Gotcha 2:In the MSUB manual, the examples of
command-line quoting show the whole search and replace strings being
surrounded by both sets of quotes. Being in a rush, I followed the
example blindly, and included some special characters inside both
quotes, e.g. -search="'^IMPORTANT POINT'"
. This didn't
work as expected since the ^ (Start-of-Line) character was being
quoted, so MSUB thought I meant an ordinary caret symbol, rather
than a Start-of-Line. Easily fixed - the inner quotes need to go
only around text that needs quoting.
MSUB can lose its place when storing context markers due to a legitimate design decision (speed at the expense of power). Context markers are used when you only want to replace a part of the full search expression. The context markers are placed either side of the part to be replaced. MSUB displays a warning message if there is a chance that the markers may not be honoured.
In the following example, the replace string worked fine, except that it retained the line feed, resulting in the weird output as described above.
^.+[^'.'] /$^/ .+$
The intention was to remove line-wrapping from a text document, for use in a word processor. The expression to the left of the first context marker specifies a general line of text that doesn't end in a full stop. The context markers tell MSUB to only replace the line separator (EOL SOL) between lines of text.
In this case, MSUB only removed the SOL (Line Feed) and left the EOL (Carriage Return), resulting in the output lines overlapping each other.
Home | About Me | Copyright © Neil Carter |
Content last updated: 2002-11-21