The `Bow' Toolkit

Bow: A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering

Bow (or libbow) is a library of C code useful for writing statistical text analysis, language modeling and information retrieval programs. The current distribution includes the library, as well as front-ends for document classification (rainbow), document retrieval (arrow) and document clustering (crossbow).

The library and its front-ends were designed and written by Andrew McCallum, with some contributions from several graduate and undergraduate students.

The name of the library rhymes with `low', not `cow'.

About the library

The library provides facilities for:

Recursively descending directories, finding text files.
Finding `document' boundaries when there are multiple documents per file.
Tokenizing a text file, according to several different methods.
Including N-grams among the tokens.
Mapping strings to integers and back again, very efficiently.
Building a sparse matrix of document/token counts.
Pruning vocabulary by word counts or by information gain.
Building and manipulating word vectors.
Setting word vector weights according to Naive Bayes, TFIDF, and several other methods.
Smoothing word probabilities according to Laplace (Dirichlet uniform), M-estimates, Witten-Bell, and Good-Turning.
Scoring queries for retrieval or classification.
Writing all data structures to disk in a compact format.
Reading the document/token matrix from disk in an efficient, sparse fashion.
Performing test/train splits, and automatic classification tests.
Operating in server mode, receiving and answering queries over a socket.

The library does not:

Have English parsing or part-of-speech tagging facilities.
Do smoothing across N-gram models.
Claim to be finished.
Have good documentation.
Claim to be bug-free.

It is known to compile on most UNIX systems, including Linux, Solaris, SUNOS, Irix and HPUX. Over a year ago, it compiled on WindowsNT (with a GNU build environment); it doesn't do this any more, but probably could with small fixes. Patches to the code are most welcome. It is developed on a Linux system.

The code conforms to the GNU coding standards. It is released under the Library GNU Public License (LGPL).

Citation

You are welcome to use the code under the terms of the licence for research or commercial purposes, however please acknowledge its use with a citation:

   McCallum, Andrew Kachites.  "Bow: A toolkit for statistical language
   modeling, text retrieval, classification and clustering."
   http://www.cs.cmu.edu/~mccallum/bow.  1996.

Here is a BiBTeX entry:

   @unpublished{McCallumLibbow,
      author = "Andrew Kachites McCallum",
      title = "Bow: A toolkit for statistical language modeling, 
               text retrieval, classification and clustering",
      note = "http://www.cs.cmu.edu/~mccallum/bow",
      year = 1996}

Obtaining the Source

Source code for the library can be downloaded from this directory. Different versions are indicated by eight digit sequences that indicate year, month and day. Thus, the most recent version is the one with the largest version number.

Unfortunately I do not have time to help rainbow's many users with all their compilation and usage problems. Feel free to send me mail asking for help, but please do not necessarily expect me to have time to help. Most appreciated are bug reports accompanied by fixes.

Bow Library Front-Ends

Provided in the library source distribution, there are currently three executable programs based on the library.

Rainbow is an executable program that does document classification. While mostly designed for classification by naive Bayes, it also provides TFIDF/Rocchio, Probabilistic Indexing and K-nearest neighbor.
Arrow is an executable program that does document retrieval. It currently only performs simple TFIDF-based retrieval.
Crossbow is a an executable program that does document clustering (and also classification).

Last updated: 12 September 1998, mccallum@cs.cmu.edu