right sed Fred

Recently, I wanted to get a word count for an electronic copy I had of a book. Not the normal "number of words" word count, but the number of times each particular word was used. I wanted to know what the most common words he used were. Then I wondered what the most common two word combinations were. I couldn’t think of a good way of doing this except for writing a program, and that looked like a lot of work, so I let the whole project slide. Then I read about sed, and I realized that my goal was much closer and easier to reach than I thought.

Sed is a "stream editor." In a nutshell, it transforms an input stream or file and sends its output to the standard output. It can do things like finding lines containing certain words (like grep), replacing a pattern with another pattern, and replacing characters with other characters (like tr). I was shocked how easy sed was to use after a small learning curve. There are only 24 commands. What I hadn’t know before was the reason for a "non-interactive stream editor" is so that you can pipe input into it, let it manipulate the text, and then do whatever with the output.

My basic strategy was to make the whole thing uppercase, remove the punctuation, put a single space between every word on a line, and then replace the spaces with carriage returns. My sed script was the following:

y/[abcdefghijklmnopqrstuvwxyz]/[ABCDEFGHIJKLMNOPQRSTUVWXYZ]/
s/[[:cntrl:]]//g
s/[ \t";:,\.\?]/ /g
s/^[ ][ ]*//g
s/[ ][ ]*$//g
s/[ ][ ]*/\n/g
s/[\n][\n]*/\n/g
/^$/d

Let’s look at it line by line. On the first line, I want to make all of the letters uppercase. I did that because I want all of the "words" together, not a count of "Words" and another for "words." The y command replaces characters. y/[abc]/[xyz]/ would replace every a,b,or c with a corresponding x,y,or z. One neat use of this function might be to make cryptograms or a simple substitution cipher by scrambling the letters in the second letter set.

The second line introduces the s/pattern/pattern/ command. It substitutes one pattern for another if it finds it. The g at the end means that the replacement will occur as many times as there are patterns on a line, without it, only the first occurrence is substituted. This line substitutes nothing for every control character in the input file. This gets rid of the \r that DOS likes to put at the end of lines.

The third line substitutes a space for all of the punctuation except a ['] which I wanted so I could see contractions as well.

The fourth and fifth lines remove leading and trailing spaces. The first pattern matches a new line with one or more spaces, and replaces that with nothing. The second pattern matches one or more spaces followed by the end of a line, and replaces that with nothing. You need to know a little bit about regular expressions to use sed so if you don’t know about * and [], you might have to read up.

The sixth line substitutes one or more spaces with a newline character. As far as sed is concerned the original line is still one line, but it has newlines embedded in it.

The seventh line replaces one or more newlines with a single newline. That was probably redundant, but I was having problems early on and I felt better leaving it in.

Finally, the last line used the d command to delete all empty lines from the text.

I would run this as

sed –f words.sed myfile.txt | sort | unique –c | sort –nr

This sorted my word list, combined like lines and counted them, and then sorts these counts and words numerically. This part of what I wanted to do (single word count) I had seen in an example. The real fun was changing it to put a word and the word after it on the same line, separated by a space.