Vim Search and Replace HTML Tags

A common but troublesome task in maintaining web pages is changing certain tags. This page explains how this can be achieved using Vim's search and replace command ':s', along with a special search and replace feature that enables some of the text that matches the search expression to be included in the replacement text.

Let's say we have a glossary that is structured as a table of many rows but only two columns, the first column displays the glossary term, whilst the second column contains the term's definition. The aim is to convert the table structure to the HTML Definition List structure.

We need to replace the opening TD tag for the first column with a DT tag, and the closing TD tag with a closing DT tag (DT meaning Definition Term). Also, we need to replace the opening tag of the second column with the opening DD tag, and the closing TD tag with a closing DD tag (DD meaning Definition Definition). The opening tags can be differentiated easily, using their width attribute. However, the challenge is in differentiating between the columns' closing tags, since they look identical.

The snippet below shows the HTML for a single glossary entry, which corresponds a single table row. We can delete the TRs and other table tags manually, since there are few of them, or they are easily searched. It's the TD tags that must be changed automatically.

<TD width=50>Psychology</TD>
<TD width=300><P>The study of attitudes and behaviour</P></TD>

The Vim command that performs the required task for the first column, which contains the definition terms is given below. The command for the corresponding operation on the second column (containing the terms' definitions) is the same, apart from the width attribute (change the 50 for 300) and the replacement DT tags (replace them with DD).

:%s/TD\ width=50>\([^<]*\)<\/TD/DT>\1<\/DT/c

: means command mode

% means all lines in file

s means substitute

/ starts the search expression

TD is the column-opening tag

\ (a backslash followed by a space) denotes an actual space (regular expressions often should not contain spaces). If the text that must be matched does contain a space, then the space in the regular expression may have to be 'escaped' (made special). This is because a space character in computing is normally used to delimit (separate) options, or expressions. In this case, we want to indicate that the space is part of the expression to be searched for.

width=74> is literal text that we are searching for.

\( starts the part of the expression that we want to capture and put back in the replace-expression. It does not form part of what is being searched for.

[^<]* This is an exclusion set. That is, it represents any character that does NOT match the characters contained in the set. The set is defined by what appears between the square brackets. The first character within the brackets is the caret (^) symbol. This indicates the the set is exclusive rather than inclusive. This means that the text should not match any of the characters within the brackets (apart from the beginning caret). The * at the end means zero or more characters. So, this expression matches any number of characters that are not a < symbol. In other words, this part of the expression matches the text that appears between the HTML tags.

\) ends the part of the text that we want to store ready for replacement. In this case, the part of the text that we want to retain is the bit between the HTML tags. It is only the tags that are to be changed, not the text between them.

< is literal text that refers to the start of the closing tag.

\/ represents the closing tag marker. The / symbol cannot be used alone, since this symbol is used to delimit the search and replace expressions. If it wasn't 'escaped' with the \ symbol, Vim would think this was the end of the search expression.

TD is literal text.

/ ends the search expression and starts the replace expression.

DT> replaces the TD opening tag with a DT opening tag.

\1 inserts into the replace text, the text that was found in the searched text, occurring between the \( and \) markers.

\/ represents the closing tag marker. The / symbol must be 'escaped' with the \ character, so that it is not interpreted as being the end of the replace expression.

DT> replaces the TD closing tag with a DT closing tag.

/ ends the replace expression.

c means request confirmation of each replacement. Use this to verify the search-replace is working correctly for the first few times. When happy that it is working, just press a for all to allow all the remaining occurrences to be replaced automatically.

Home About Me
Copyright © Neil Carter

Content last updated: 2005-12-06