Regular expression syntax has two line-break expressions: the start-of-line (SOL) character '^' and the end-of-line (EOL) character '$'.
In DOS, a line-break consists of a carriage-return (CR) character (ASCII code Hex 0D or Dec 13) followed by a line-feed (LF) character (ASCII code Hex 0A or Dec 10). In Unix, line-breaks consist of only the LF character.
Performing searches across line-breaks can be tricky with many RE programs as they usually work on a line-by-line basis.
Word processors and text editors treat lines and paragraphs differently. In a text file, lines are separated by hard-returns (typically the standard line-break of the operating system). Text editors have no concept of a paragraph, and are delimited for human readability using other means, usually an indent on the first line of each paragraph or a blank line after each paragraph.
Word processors treat paragraphs as one long line, and perform word wrapping-dynamically to accomodate varying output widths. Effectively, in word processors, each paragraph is stored as a single line.
Word processors are not always clever enough to merge the multiple lines of one paragraph in a text-file into a single paragraph. This means that the word processor cannot properly wrap the lines with a paragraph. What is needed is a means of concatenating multiple lines from the same paragraph into one line. It is possible to do this using a search&replace process to replace line-breaks with single spaces (to prevent the gap between words being lost).
However, specifying a paragraph using REs is not easy. For most cases, a paragraph can be defined as being a series of non-blank lines terminated by either a blank line, or a line starting with more than one blank. Sometimes, lines may be wrapped so that the space between words in the same paragraph appears at the start of a new line. In this case, we don't want to start a new paragraph. We should only start a new paragraph if there is more than one space, as it is unlikely that a series of spaces would appear within a paragraph.
Below I describe the operation of each search&replace pair. Each one was tested separately, and I have explained the failure of those that were wrong. The search-expressions handled paragraph breaks (blank lines) correctly, and the bugs only applied to concatenating the lines within a paragraph.
^`.+`$
`' '
Search for an SOL, followed by any chars, followed by EOL. We only need to find one char on the following to call this a non-blank line. Retain the contents of the line, but delete the SOL and EOL and replace with a space.
The problem with the above RE are that it removes the SOL from the beginning of the file and the EOL from the end of the file, leaves a trailing space at the end of each line (paragraph), and doesn't start a new paragraph for lines starting with an indent.
[^\26]^`.+`$
`' '
ASCII code Dec 26 (the SUB char) is DOS's end-of-file (EOF) character, but also appears as the first character in a file, so can also be considered to be a start-of-file (SOF) character.
To prevent the SOL dissappearing from the first line in the file, I added a 'not-SOF' to the beginning of the search RE. This means it won't match the first line of the file because the preceding character of the first line is an SOF character. However, it still doesn't work because it only matches every other line and drops all the EOL characters. It matches every other line because of the way the not-SOF matches characters.
Below is a step-by-step description of the search & replace operation.
The search & process now operates in a cyclic nature within a paragraph:
When the replace function removes the EOL from the second line of each cycle, it doesn't replace it with anything, so we are left with a file full of concatenated double lines, with an invalid (as far as DOS is concerned) line-break consisting of an SOL, but no EOL (0Ah, without the preceding 0Dh).
[^\26]/^`.+`$/
`' '
the first line is ignored because it has an EOF char before the SOL. The second line is matched because the char before the SOL is not an EOF, it is an EOL instead, but we should not replace this as it is the pre-context the second line has its SOL (LF or 0A) and EOL (CR or 0D) removed and terminates in a space the third line is ignored.
It skips every other line because the line after a matched line loses its EOL, so the next available char is the SOL, which is matched by the not-EOF, so the SOL in the search expression fails coincides with a normal char in the file.
The only difference between this RE and the previous is that this one uses the '/' context markers to avoid replacing the char matched by the 'not-EOF'.
Again, the first line of the file doesn't match the search expression because of the not-SOF.
The operation of this expression is similar to the last except that the EOL of each unmatched line is retained so these lines are retained fully, but the next line loses its SOL, and EOL, so we end up with an alternation of invalid line-breaks.
The first line is not matched (SOF, SOL, content, EOL) so leave unchanged and move to next line. The process then follows the cycle below:
[^^]/$^`.`/
' '`
search for not-SOL, EOL, SOL, any char. Replace with a space and the matched character. This effectively replaces line-breaks between two non-blank lines with spaces.
The change in thinking for this expression is in concentrating on the line-break between the previous line and the current line, and ignoring the EOL of the current line. This expression just looks for the line break between two non-blank lines.
The first component is a non-SOL (this time I'm avoiding start-of-LINEs as opposed to the start-of-FILEs in the previous expressions). The first component is a not start-of-line, and the second is a line break (two expressions - an EOL followed by an SOL). These three expressions combined (a product expression) ensure that the line-break is preceded by a non-blank line. After the line-break, I specify any non-line-break character, thereby ensuring that the following line is also non-empty (but it may contain only blanks).
The problem with the otherwise successful RE above is that it concatenates lines which start with spaces, even though one definition of a paragraph states that they start with an indentation (more than one space).
This expression could be enhanced by ensuring that the line following the line-break does contain non-whitespace characters.
Home | About Me | Copyright © Neil Carter |
Content last updated: 2000-07-02