MSUB Script to Convert Lines into Paragraphs

Explanation

Line Breaks

Regular expression syntax has two line-break expressions: the start-of-line (SOL) character '^' and the end-of-line (EOL) character '$'.

In DOS, a line-break consists of a carriage-return (CR) character (ASCII code Hex 0D or Dec 13) followed by a line-feed (LF) character (ASCII code Hex 0A or Dec 10). In Unix, line-breaks consist of only the LF character.

Performing searches across line-breaks can be tricky with many RE programs as they usually work on a line-by-line basis.

Paragraphs v. Lines

Word processors and text editors treat lines and paragraphs differently. In a text file, lines are separated by hard-returns (typically the standard line-break of the operating system). Text editors have no concept of a paragraph, and are delimited for human readability using other means, usually an indent on the first line of each paragraph or a blank line after each paragraph.

Word processors treat paragraphs as one long line, and perform word wrapping-dynamically to accomodate varying output widths. Effectively, in word processors, each paragraph is stored as a single line.

Word processors are not always clever enough to merge the multiple lines of one paragraph in a text-file into a single paragraph. This means that the word processor cannot properly wrap the lines with a paragraph. What is needed is a means of concatenating multiple lines from the same paragraph into one line. It is possible to do this using a search&replace process to replace line-breaks with single spaces (to prevent the gap between words being lost).

However, specifying a paragraph using REs is not easy. For most cases, a paragraph can be defined as being a series of non-blank lines terminated by either a blank line, or a line starting with more than one blank. Sometimes, lines may be wrapped so that the space between words in the same paragraph appears at the start of a new line. In this case, we don't want to start a new paragraph. We should only start a new paragraph if there is more than one space, as it is unlikely that a series of spaces would appear within a paragraph.

The Regular Expressions

Below I describe the operation of each search&replace pair. Each one was tested separately, and I have explained the failure of those that were wrong. The search-expressions handled paragraph breaks (blank lines) correctly, and the bugs only applied to concatenating the lines within a paragraph.


RE 1


^`.+`$
`' '

Search for an SOL, followed by any chars, followed by EOL. We only need to find one char on the following to call this a non-blank line. Retain the contents of the line, but delete the SOL and EOL and replace with a space.

The problem with the above RE are that it removes the SOL from the beginning of the file and the EOL from the end of the file, leaves a trailing space at the end of each line (paragraph), and doesn't start a new paragraph for lines starting with an indent.


RE 2


[^\26]^`.+`$
`' '

ASCII code Dec 26 (the SUB char) is DOS's end-of-file (EOF) character, but also appears as the first character in a file, so can also be considered to be a start-of-file (SOF) character.

To prevent the SOL dissappearing from the first line in the file, I added a 'not-SOF' to the beginning of the search RE. This means it won't match the first line of the file because the preceding character of the first line is an SOF character. However, it still doesn't work because it only matches every other line and drops all the EOL characters. It matches every other line because of the way the not-SOF matches characters.

Below is a step-by-step description of the search & replace operation.

  1. The first line isn't matched because it fails the not-SOF component.
  2. The first match made is the second line. The not-SOF component matches the EOL of the first line, whilst the whole of the second line (SOL, <content>, EOL) matches the rest of the search expression.
  3. The line-break between the first and second lines is removed.
  4. The content of the second line is added onto the end of the first line, with no separating characters (this is a bug - a space should separate them).
  5. The EOL of the second line is replaced by a trailing space.
  6. The first (combined) line of the result file now reads: SOL, content (1st line + 2nd line), space. Note that the second line has lost its EOL. This may not be a problem, as we may not have finished this paragraph yet.

The search & process now operates in a cyclic nature within a paragraph:

  1. The RE cannot match the next line in the file because the remaining text starts with the SOL of the next line.
  2. This SOL is matched by the not-SOF in the search expression, so the following SOL in the search expression can't match the next character of the line (we'd need two SOLs in a row to match the first two expressions of this RE.
  3. The search continues failing each character until the line-break between the current and next lines. The EOL of the current line is matched by the not-EOF, and the SOL of the next line matches the SOL of the search expression.
  4. The line break between the current and next line is deleted, and the next line is added onto the end of the current line with nothing separating them (that bug again!).
  5. A space is added at the end of this new combined line.
  6. The process repeats the above steps in a cyclic nature (matching a line, then failing a line within a single paragraph) until the end of the file.

When the replace function removes the EOL from the second line of each cycle, it doesn't replace it with anything, so we are left with a file full of concatenated double lines, with an invalid (as far as DOS is concerned) line-break consisting of an SOL, but no EOL (0Ah, without the preceding 0Dh).


RE 3


[^\26]/^`.+`$/
`' '

the first line is ignored because it has an EOF char before the SOL. The second line is matched because the char before the SOL is not an EOF, it is an EOL instead, but we should not replace this as it is the pre-context the second line has its SOL (LF or 0A) and EOL (CR or 0D) removed and terminates in a space the third line is ignored.

It skips every other line because the line after a matched line loses its EOL, so the next available char is the SOL, which is matched by the not-EOF, so the SOL in the search expression fails coincides with a normal char in the file.

The only difference between this RE and the previous is that this one uses the '/' context markers to avoid replacing the char matched by the 'not-EOF'.

Again, the first line of the file doesn't match the search expression because of the not-SOF.

The operation of this expression is similar to the last except that the EOL of each unmatched line is retained so these lines are retained fully, but the next line loses its SOL, and EOL, so we end up with an alternation of invalid line-breaks.

The first line is not matched (SOF, SOL, content, EOL) so leave unchanged and move to next line. The process then follows the cycle below:

  1. EOL of last line, SOL of current line, content of current line, EOL of current line matched.
  2. Replaced with: EOL of last line, content of current line, space.
  3. Next line (SOL, content, EOL) not matched, so leave unchanged in result file, and move to next line.
  4. Repeat cycle.

RE 4


[^^]/$^`.`/
' '`

search for not-SOL, EOL, SOL, any char. Replace with a space and the matched character. This effectively replaces line-breaks between two non-blank lines with spaces.

The change in thinking for this expression is in concentrating on the line-break between the previous line and the current line, and ignoring the EOL of the current line. This expression just looks for the line break between two non-blank lines.

The first component is a non-SOL (this time I'm avoiding start-of-LINEs as opposed to the start-of-FILEs in the previous expressions). The first component is a not start-of-line, and the second is a line break (two expressions - an EOL followed by an SOL). These three expressions combined (a product expression) ensure that the line-break is preceded by a non-blank line. After the line-break, I specify any non-line-break character, thereby ensuring that the following line is also non-empty (but it may contain only blanks).

The problem with the otherwise successful RE above is that it concatenates lines which start with spaces, even though one definition of a paragraph states that they start with an indentation (more than one space).

This expression could be enhanced by ensuring that the line following the line-break does contain non-whitespace characters.


Home About Me
Copyright © Neil Carter

Content last updated: 2000-07-02