JP van Oosten

Improve your writing with shell scripts

Jul 26, 2010

Currently, I’m working on my Master’s thesis on Hidden Markov Models. Matt Might wrote an article on three shell scripts to improve your writing, which I found interesting. The scripts help to detect the use of passive voice, weasel words (such as “surprisingly low”) and duplicate words (which are difficult to detect when a line break separates them).

One of the remarks I repeatedly received was that my paragraphs were much too short. A couple of paragraphs were just one or two sentences, which I could usually just throw together.

In the light of the Matt’s scripts, I wrote my own version, in which I detect paragraphs with only two or three sentences and spanning only a few lines. It isn’t perfect and not all small paragraphs need to be long, but it might warrant a closer inspection of your text.

#!/usr/bin/env python
import sys
import re

SINGLE_COMMAND_RE = re.compile(r'^\\\w+\{[^}]+\}$')

def process(file):
    """Ignores lines containing only a single command at the beginning of a
    paragraph (piece of text surrounded by blank lines)."""
    paragraph = [0, 0, 0] # [start, sentence count, linecount]
    prev_line = None
    for linenum, line in enumerate(file):
        line = line.strip()
        if SINGLE_COMMAND_RE.match(line) and not prev_line:
        if not line and paragraph[1]:
            report_short_paragraph(filename, paragraph)
            paragraph = [0, 0, 0]
            if paragraph[0] == 0:
                paragraph = [linenum, 0, 0]
            paragraph[1] += line.count('.')
            paragraph[1] -= 2 * line.count('...')
            paragraph[2] = linenum - paragraph[0] + 1
        prev_line = line

def report_short_paragraph(filename, paragraph):
    if paragraph[1] <= 2 and paragraph[2] < 4:
        print '%s:%d paragraph of %d sentence(s) / %d lines' % (filename,
                paragraph[0] + 1, paragraph[1], paragraph[2])

if __name__ == "__main__":
    for filename in sys.argv[1:]:
        file = open(filename)