The `Bow' Toolkit
Bow: A Toolkit for Statistical Language Modeling,
Text Retrieval, Classification and Clustering
Bow (or libbow) is a library of C code useful for
writing statistical text analysis, language modeling and information
retrieval programs. The current distribution includes the library, as
well as front-ends for document classification (rainbow),
document retrieval (arrow) and document clustering
(crossbow).
The library and its front-ends were designed and written by Andrew McCallum, with some
contributions from several graduate
and undergraduate students.
The name of the library rhymes with `low', not `cow'.
About the library
The library provides facilities for:
- Recursively descending directories, finding text files.
- Finding `document' boundaries when there are multiple documents per file.
- Tokenizing a text file, according to several different methods.
- Including N-grams among the tokens.
- Mapping strings to integers and back again, very efficiently.
- Building a sparse matrix of document/token counts.
- Pruning vocabulary by word counts or by information gain.
- Building and manipulating word vectors.
- Setting word vector weights according to Naive Bayes, TFIDF, and
several other methods.
- Smoothing word probabilities according to Laplace (Dirichlet
uniform), M-estimates, Witten-Bell, and Good-Turning.
- Scoring queries for retrieval or classification.
- Writing all data structures to disk in a compact format.
- Reading the document/token matrix from disk in an efficient,
sparse fashion.
- Performing test/train splits, and automatic classification tests.
- Operating in server mode, receiving and answering queries over a
socket.
The library does not:
- Have English parsing or part-of-speech tagging facilities.
- Do smoothing across N-gram models.
- Claim to be finished.
- Have good documentation.
- Claim to be bug-free.
It is known to compile on most UNIX systems, including Linux,
Solaris, SUNOS, Irix and HPUX. Over a year ago, it compiled on
WindowsNT (with a GNU build environment); it doesn't do this any more,
but probably could with small fixes. Patches to the code are most
welcome. It is developed on a Linux system.
The code conforms to the GNU coding
standards. It is released under the Library GNU Public
License (LGPL).
Citation
You are welcome to use the code under the terms of the licence for
research or commercial purposes, however please acknowledge its use
with a citation:
McCallum, Andrew Kachites. "Bow: A toolkit for statistical language
modeling, text retrieval, classification and clustering."
http://www.cs.cmu.edu/~mccallum/bow. 1996.
Here is a BiBTeX entry:
@unpublished{McCallumLibbow,
author = "Andrew Kachites McCallum",
title = "Bow: A toolkit for statistical language modeling,
text retrieval, classification and clustering",
note = "http://www.cs.cmu.edu/~mccallum/bow",
year = 1996}
Obtaining the Source
Source code for the library can be downloaded from this directory. Different versions are indicated by
eight digit sequences that indicate year, month and day. Thus, the most
recent version is the one with the largest version number.
Unfortunately I do not have time to help rainbow's many users
with all their compilation and usage problems. Feel free to send me
mail asking for help, but please do not necessarily expect me to have
time to help. Most appreciated are bug reports accompanied by
fixes.
Bow Library Front-Ends
Provided in the library source distribution, there are currently three
executable programs based on the library.
- Rainbow is an executable program that does
document classification. While mostly designed for classification by
naive Bayes, it also provides TFIDF/Rocchio, Probabilistic Indexing
and K-nearest neighbor.
- Arrow is an executable program that does
document retrieval. It currently only performs simple TFIDF-based
retrieval.
- Crossbow is a an executable program that
does document clustering (and also classification).
Last updated: 12 September 1998,
mccallum@cs.cmu.edu