Ryan Kavanagh's Blog

From: Ryan Kavanagh rak@debian.org  Wed Feb 20 08:52 2013
To: Blog
Subject: Search RCS and CVS ",v" files with rcsgrep
Date: Wed, 20 Feb 2013 08:52:09 (EST)
X-Categories:  planet-ubuntu


A few years ago I was doing research comparing how large software distributions handled shared object libraries, and studied Debian, FreeBSD, and Ubuntu. Extracting data about Debian packages was easy thanks to Peter Palfrader’s snapshot.debian.org service, which provides a machine-usable interface to Debian’s package history. FreeBSD’s data is equally accessible, albeit in a less pleasant format: their ports tree was stored in CVS until July 2012. One could easily rsync a copy of the ports tree’s CVS repository to a local machine to analyze the data. This left you with a local tree full of ,v files, each corresponding to the history of a given file with at that location. I needed to extract all kinds of data from a tree full of these files, such as what revisions contained lines matching a regex, when these revisions were checked in, any tags associated with it, etc. To make things easier, it also helped to know the line numbers of the matching lines. Hence the birth of rcsgrep.

rcsgrep is a Python script that makes use of Paul McGuire’s fabulous pyparsing library. It allows you to search a RCS file (the ,v file format used by RCS and CVS to store revision history) using a Python regex, and the output format is customizable to allow printing only certain kinds of information, such as the revision number, the line number, the matching line, the line’s author, the date it appeared, any tags associated with the line, and (useful when running over a large number of files) the file name. To make machine parsing (using AWK of course) easier, you can also specify the column separator.

For example, I entered the lines “The quick brown”, “fox jumped over”, “the lazy dog. Woof!” into the file abc, checking in the changes after each line. The invocation ./rcsgrep -s ' ' -f rlLda '.*' abc,v, with spaces for column separation, and format options r is for revision, l for line number, L for line contents, d for date, and a for author, outputs:

1.3    1    The quick brown    2013.02.20.14.24.09    ryan
1.3    2    jumped over the    2013.02.20.14.24.09    ryan
1.3    3    lazy dog. Woof!    2013.02.20.14.24.09    ryan
1.2    1    The quick brown    2013.02.20.14.23.48    ryan
1.2    2    jumped over the    2013.02.20.14.23.48    ryan
1.1    1    The quick brown    2013.02.20.14.23.25    ryan

I’m particularly proud about my grep() function in rcsfile.py, which goes through each revision, tracking additions and deletions, but only keeping the lines matching the regex in memory. In any case, rcsgrep is licensed under the ISC license and can be found on github.

Addendum: I learned after the fact that O’Reilly’s “UNIX Power Tools” offers something similar by the same name, except that it is runs several processes, such as co, grep and sed, as opposed to a single Python script.

--
|_)|_/  Ryan Kavanagh		| Debian Developer
| \| \  http://ryanak.ca/	| GPG Key 4A11C97A