This documentation is intended as a brief tutorial for using rainbow, version 0.9 or later. It is not complete documentation. It is not a tutorial on the source code.
The examples on this page assume that you have compiled libbow and rainbow, and that rainbow is in your path. Several of the examples also assume that you have downloaded the 20_newsgroups data set, unpacked it in your home directory, and therefore that its files are available in the directory ~/20_newsgroups.
You can obtain on-line documentation of each rainbow command-line option by typing
rainbow --help | moreThis --help option is useful checking the latest details of particular options, but does not provide a tutorial or an overview of rainbow's use.
Command-line options in rainbow and all the Bow library frontends are handled by the libargp library from the FSF. Many command-line options have both long and short forms. For example, to set the verbosity level to 4 (to make rainbow give more runtime diagnostic messages than usual), you can type "--verbosity=4", or "--verbosity 4", or "-v 4". For more detail about the verbosity option, see section 5.1.
Before performing classification or diagnostics with rainbow, you must first have rainbow index your data--that is, read your documents and archive a "model" containing their statistics. The text indexed for the model must contain all the training data. The testing data may also be read as part of the model, or it can be left out and read later.
The model is placed in the file system location indicated by the -d option. If no -d option is given, the name ~/.rainbow is used by default. (The model name is actually a file system directory containing separate files for different aspects of the model. If the model directory location does not exist when rainbow is invoked, rainbow will create it automatically.)
In the most basic setting, the text data should be in plain text files, one file per document. No special tags are needed at the beginning or end of documents. Thus, for example, you should be able to index a directory of UseNet articles or MH mailboxes without any preprocessing. The files should be organized in directories, such that all documents with the same class label are contained within a directory. (Rainbow does not directly support classification tasks in which individual documents have multiple class labels. I recommend handling this as a series of binary classification tasks.)
To build a model, call rainbow with the --index (or -i) option, followed by one directory name for each class. For example, to build a model that distinguishes among the three talk.politics classes of 20_newsgroups, (and store that model in the directory ~/model), invoke rainbow like this:
rainbow -d ~/model --index ~/20_newsgroups/talk.politics.*where ~/20_newsgroups/talk.politics.* would be expanded by the shell like this:
~/20_newsgroups/talk.politics.guns ~/20_newsgroups/talk.politics.mideast ~/20_newsgroups/talk.politics.misc
To build a model containing all 20 newsgroups, type:
rainbow -d ~/model --index ~/20_newsgroups/*
When indexing a file, rainbow turns the file's stream of characters into tokens by a process called tokenization or "lexing".
By default, rainbow tokenizes all alphabetic sequences of characters (that is characters in A-Z and a-z), changing each sequence to lowercase and tossing out any token which is on the "stoplist", a list of common words such as "the", "of", "is", etc.
Rainbow supports several options for tokenizing text. For example the --skip-headers (or -h) option causes rainbow to skip newsgroup or email headers before beginning tokenization. (Which should be used for the 20_newsgroups dataset, since the headers include the name of the correct newsgroup!) It does this by scanning forward until it finds two newlines in a row.
rainbow -d ~/model -h --index ~/20_newsgroups/talk.politics/*
Some other examples of handy tokenizing options are:
--use-stemming | Pass all words through the Porter stemmer before counting them. (The default is not to stem.) |
--no-stoplist | Include words in the stoplist among the statistics. The default is to skip them. The stoplist is the SMART system's list of 524 common words, like "the" and "of".) |
--istext-avoid-uuencode | Attempt to detect when a file mostly consists of a uuencoded block, and if so, skip it. This option is useful for tokenizing UseNet articles, because word statistics can be thrown off by repetitive tokens found in uuencoded images. |
--skip-html | Skip all characters between "<" and ">". Useful for lexing HTML files. |
--lex-pipe-command SHELLCMD | Rather than tokenizing the
file directly, pass the file as standard input into this shell
command, and tokenize the standard output of the shell command. For
example, to index only the first 20 lines of each file, use: rainbow --lex-pipe-command "head -n 20" -d ~/model --index ~/20_newsgroups/talk.politics/* |
--lex-white | Rather than tokenizing the file with the default rules (skipping non-alphabetics, downcasing, etc), instead simply grab space-delimited strings, and make no further changes. This option is useful if you want to take complete control of tokenization with your own script, as specified by --lex-pipe-command, and don't want rainbow to make any further changes. |
For a complete list of rainbow tokenizing options, see the "Lexing options" section in the output of rainbow --help.
Once indexing is performed and a model has been archived to disk, rainbow can perform document classification. Statistics from a set of training documents will determine the parameters of the classifier; classification of a set of testing documents will be output.
The --test (or -t) option performs a specified number of trials and prints the classifications of the documents in each trial's test-set to standard output. For example,
rainbow -d ~/model --test-set=0.4 --test=3will output the results of three trials, each with a randomized test-train split in which 60 percent of the documents are used for training, and 40 percent for testing. Details of the --test-set option are described in section 3.1.
Classification results are printed as a series of text lines that look something like this:
/home/mccallum/20_newsgroups/talk.politics.misc/178939 talk.politics.misc talk.politics.misc:0.98 talk.politics.mideast:0.015 talk.politics.guns:0.005
That is, one test file per line, consisting of the following fields:
directory/filename TrueClass TopPredictedClass:score1 2ndPredictedClass:score2 ...
The Perl script rainbow-stats, which is provided in the Bow source distribution, reads lines like this and outputs average accuracy, standard error, and a confusion matrix.
For example, the command
rainbow -d ~/model --test-set=0.4 --test=2 | rainbow-statswill, for a model build from the three talk.politics classes, print something like the following:
Trial 0 Correct: 1079 out of 1201 (89.84 percent accuracy) - Confusion details, row is actual, column is predicted classname 0 1 2 :total 0 talk.politics.guns 372 2 27 :401 92.77% 1 talk.politics.mideast 6 371 23 :400 92.75% 2 talk.politics.misc 44 20 336 :400 84.00% Trial 1 Correct: 1086 out of 1201 (90.42 percent accuracy) - Confusion details, row is actual, column is predicted classname 0 1 2 :total 0 talk.politics.guns 377 2 22 :401 94.01% 1 talk.politics.mideast 6 371 23 :400 92.75% 2 talk.politics.misc 40 22 338 :400 84.50% Percent_Accuracy average 90.13 stderr 0.21 |
(To give you some idea of the speed of rainbow: On a 200 MHz Pentium, the above rainbow command finishes in 14 seconds. The command reads the model from disk, and performs two trials--each building a model from about 1800 documents and testing on about 1200. The rainbow-stats command finishes in 2 seconds.)
The Perl script rainbow-be, also provided in the Bow source distribution, reads lines like this and outputs precision-recall breakeven points.
You can vary the precision with which classification scores are printed using the --score-precision=NUM option, where NUM is the number of digits to print after the decimal point. Note, however, that several internal variables are of type float, (which has only about 7 digits of resolution) and the classification scores are calculated as double's, (which has only about 17 digits of resolution), so precision is inherently limited. The default printed score precision is 10. This option works only with the naive Bayes classifier.
rainbow -d ~/model --test-set=0.5 --test=1will use a pseudo-random number generator to select one-half of the documents in the model and place them into the test set, then place the remaining documents in the training set.
When the argument to --test-set contains no decimal point, the number is interpreted as an exact number of documents. For example,
rainbow -d ~/model --test-set=30 --test=1will place 30 documents in the test set, attempting to select a number of documents from each class such that the class proportions in the test set roughly matches that in the entire model.
If the number argument is followed by "pc", then the arguments indicates a number of documents per class. Thus
rainbow -d ~/model --test-set=200pc --test=1will place into the test set 200 randomly-selected documents from each of the classes in the model, for a total of 600 test documents, if the model was build using three classes.
You can also specify exactly which files should be in the test set, listing them by name. If the argument to --test-set contains non-numeric characters, it is interpreted as a filename, which in turn should contain a list of white-space-separated filenames of documents indexed in the model. For example,
rainbow -d ~/model --test-set=~/filelist1 --test=1will open the file ~/filelist1 and take from there the list of names of files to be place in the test set. Note that the class labels of these documents are already known from when the model file was built.
The list of filenames should be named as they where then the model was built. A list of all the filenames of documents contained in a rainbow model can be obtained with the following command:
rainbow -d ~/model --print-doc-names
See section 4.3 for more details on the --print-doc-names option.
The default value for --test-set is 0, indicating the no documents are placed in the test set. Thus, when using the --test option, you must use the --test-set option in order to give rainbow some documents to classify.
The training set can be specified using the --train-set option with the same types of arguments described above. For example,
rainbow -d ~/model --test-set=~/filelist1 --train-set=~/filelist2 --test=1will take all test documents from the list in ~/filelist1, all training documents from ~/filelist2, and ignore all documents that don't appear in either list. It is an error for a document to be listed in both the test set and the train set.
The default value for the --train-set is the keyword remaining, which specifies that all documents not placed in the test set should be placed in the training set.
The keyword remaining can also be used for the test set. For example,
rainbow -d ~/model --train-set=1pc --test-set=remaining --test=1will put one document from each class into the training set, and put all the rest of the documents in the testing set.
You can classify files that were not indexed into the model by replacing the --test option with the --test-files option. For example,
rainbow -d ~/model --test-files ~/more-talk.politics/*will use all the files in the model as the training set, and output classifications for all files contained in the subdirectories of ~/more-talk.politics/. Note that the number and basenames of the directories listed must match those given to --index when the model was built.
You can classify a single file (read from standard input or from a specified filename) using the --query option.
Rainbow can also efficiently classify individual documents not in the model by running as a server. In this mode, rainbow starts, reads the model from disk, then waits for query documents by listening on a network socket.
To do this, run rainbow with the command line option --query-server=PORT (where PORT is some port number larger than 1000). For example
rainbow -d ~/model --query-server=1821
In order to test the server, telnet to whatever port you specified (e.g. "telnet localhost 1821"), type in a document you want to classify, then type '.' alone on a line, followed by Return. Rainbow will then print back to the socket (and thus to your screen) a list of classes and their scores. If you write your own program to connect to a rainbow server (to replace telnet in this example), make sure to use the sequence "\r\n" to send a newline. Thus, to indicate the end of a query document, you should send the sequence "\r\n.\r\n".
Feature set or "vocabulary" size may be reduced by by occurrence counts or by average mutual information with the class variable ([Cover & Thomas, "Elements of Information Theory" Wiley & Sons, 1991], (which we also call "information gain").
--prune-vocab-by-infogain=N or -T |
Remove all but the top N words by selecting words with highest average mutual information with the class variable. Default is N=0, which is a special case that removes no words. |
--prune-vocab-by-doc-count=N or -D |
Remove words that occur in N or fewer documents. |
--prune-vocab-by-occur-count=N or -O |
Remove words that occur less than N times. |
For example, to classify using only the 50 words that have the highest mutual information with the class variable, type:
rainbow -d ~/model --prune-vocab-by-infogain=50 --test=1
If you want to see what these 50 words are, type:
rainbow -d ~/model -I 50There is more information about -I and other diagnostic-printing command-line options options in section 4.
rainbow -d ~/model --method=tfidf --test=1will use TFIDF/Rocchio for classification.
--smoothing-method=METHOD | Set the method for smoothing word probabilities to avoid zeros; METHOD may be one of: goodturing, laplace, mestimate, wittenbell. The default is laplace, which is a uniform Dirichlet prior with alpha=2. |
--event-model=EVENTNAME | Set what objects will be considered the `events' of the probabilistic model. EVENTNAME can be one of: word (i.e. multinomial, unigram), document (i.e. multi-variate Bernoulli, bit vector), or document-then-word (i.e. document-length-normalized multinomial). For more details on these methods, see A Comparison of Event Models for Naive Bayes Text Classification. The default is word. |
--uniform-class-priors | When classifying and calculating mutual information, use equal prior probabilities on classes, instead of using the distribution determined from the training data. |
In addition to using a model for document classification, you can also print various information about the model.
To see a list of the words that have highest average mutual information with the class variable (sorted by mutual information), use the --print-word-infogain (or -I) option. For example
rainbow -d ~/model -I 10
When invoked on a model containing all 20 classes of the 20_newsgroups dataset, the following is printed to standard out:
0.09381 windows 0.09003 god 0.07900 dod 0.07700 government 0.06609 team 0.06570 game 0.06448 people 0.06323 car 0.06171 bike 0.05609 hockeyThe above is calculated using all the training data. To restrict the calculation to a subset of the data, use any of the methods for defining the training set described in section 3.1. For example, to calculate mutual information based just on the the documents listed in ~/docs1, type:
rainbow -d ~/model --train-set=~/docs1 -I 10
rainbow -d ~/model -T 10 --print-word-probabilities=talk.politics.mideast
Here is the output of this command. Notice that the word probabilities correctly sum to one.
god 0.05026782 people 0.64977338 government 0.24062629 car 0.03502266 game 0.00412031 team 0.01030078 bike 0.00041203 dod 0.00041203 hockey 0.00123609 windows 0.00782859
To print the number of times a word occurs in each class (as well as the total number of words in the class, and the word's probability in each class), use the --print-word-counts option. For example, the following command prints diagnostics about the word team.
rainbow -d ~/model --print-word-counts=team
Here is the output on the above command, on a model built from 20_newsgroups. Note that the word probabilities (in parenthesis) may not simply be equal to the ratio of the two previous counts because of smoothing.
2 / 125039 ( 0.00002) alt.atheism 6 / 119511 ( 0.00005) comp.graphics 5 / 91147 ( 0.00005) comp.os.ms-windows.misc 1 / 71002 ( 0.00001) comp.sys.mac.hardware 12 / 131120 ( 0.00009) comp.windows.x 15 / 62130 ( 0.00024) misc.forsale 2 / 83942 ( 0.00002) rec.autos 10 / 78685 ( 0.00013) rec.motorcycles 543 / 88623 ( 0.00613) rec.sport.baseball 970 / 115109 ( 0.00843) rec.sport.hockey 9 / 136655 ( 0.00007) sci.crypt 1 / 81206 ( 0.00001) sci.electronics 8 / 125235 ( 0.00006) sci.med 71 / 128754 ( 0.00055) sci.space 2 / 141389 ( 0.00001) soc.religion.christian 13 / 135054 ( 0.00010) talk.politics.guns 24 / 208367 ( 0.00012) talk.politics.mideast 14 / 164266 ( 0.00009) talk.politics.misc 9 / 130013 ( 0.00007) talk.religion.misc
(Note: the probability of the word team is not equal to the probability of team from the --print-word-probabilities command above, because we did not reduce vocabulary size to 10 in this example.
To print a list of the filenames of all documents, use the --print-doc-names option. Document filenames are printed in the order in which they were indexed. Thus all documents of the same class appear contiguously.
This command is often useful for generating lists of document names to be used with the --test-set and --train-set options.
For example, the following command prints 10 randomly selected documents that were indexed. In order to obtain a random selection, gawk, the GNU version of awk, is used to generate random numbers, and sort is used to permute the list. The command head is then used to select the first 10 from the permuted list.
rainbow -d ~/model --print-doc-names \ | gawk '{print rand(), $1}' | sort -n | gawk '{print $2}' | head -n 10
Example output of this command on the 20_newsgroups data set is:
~/20_newsgroups/rec.motorcycles/104735 ~/20_newsgroups/comp.windows.x/67345 ~/20_newsgroups/sci.med/59555 ~/20_newsgroups/talk.politics.misc/178418 ~/20_newsgroups/misc.forsale/76867 ~/20_newsgroups/rec.sport.hockey/52601 ~/20_newsgroups/talk.politics.mideast/77394 ~/20_newsgroups/comp.os.ms-windows.misc/9661 ~/20_newsgroups/talk.politics.mideast/75947 ~/20_newsgroups/talk.politics.misc/179105
You can also print the names of just those documents that fall into one of the sets of the test/train split. For example
rainbow -d ~/model --train-set=3pc --print-doc-names=trainwill select three documents from each class to be in the training set, and print just those documents. The output of this command might be:
~/20_newsgroups/talk.politics.guns/53329 ~/20_newsgroups/talk.politics.guns/54704 ~/20_newsgroups/talk.politics.guns/54656 ~/20_newsgroups/talk.politics.mideast/76420 ~/20_newsgroups/talk.politics.mideast/76523 ~/20_newsgroups/talk.politics.mideast/77392 ~/20_newsgroups/talk.politics.misc/179005 ~/20_newsgroups/talk.politics.misc/176939 ~/20_newsgroups/talk.politics.misc/179083
You can print the entire word/document matrix to standard output in using the --print-matrix option. Documents are printed one to a line. The first (white-space separated) field is the document name; this is followed by entries for the words.
There are several different alternatives for the format in which the words are printed, and all of them are amenable to processing by perl or awk, and somewhat human-readable. The alternatives are specified by an optional "formatting" argument to the --print-matrix option.
The format is specified as a string of three characters, consisting of selections from the following three groups
Print entries for all words in the vocabulary, or just print the words that actually occur in the document. | |
a | all |
s | sparse, (default) |
Print word counts as integers or as binary presence/absence indicators. | |
b | binary |
i | integer, (default) |
How to indicate the word itself. | |
n | integer word index |
w | word string |
c | combination of integer word index and word string, (default) |
e | empty, don't print anything to indicate the identity of the word |
For example, to print a sparse matrix, in which the word string and the word counts for each document are listed, use the format string ``siw''. The command
rainbow -d ~/model -T 100 --print-matrix=siw | head -n 10
reduces the vocabulary to only 100 words, then prints
~/20_newsgroups/alt.atheism/53366 alt.atheism god 2 jesus 1 nasa 2 people 2 ~/20_newsgroups/alt.atheism/53367 alt.atheism jesus 2 jewish 1 christian 1 ~/20_newsgroups/alt.atheism/51247 alt.atheism god 4 evidence 2 ~/20_newsgroups/alt.atheism/51248 alt.atheism ~/20_newsgroups/alt.atheism/51249 alt.atheism nasa 1 country 2 files 1 law 3 system 1 government 1 ~/20_newsgroups/alt.atheism/51250 alt.atheism god 3 people 2 evidence 1 law 1 system 1 public 5 rights 1 fact 1 religious 1 ~/20_newsgroups/alt.atheism/51251 alt.atheism ~/20_newsgroups/alt.atheism/51252 alt.atheism people 4 evidence 2 system 2 religion 1 ~/20_newsgroups/alt.atheism/51253 alt.atheism god 19 christian 1 evidence 1 faith 5 car 2 space 1 game 1 ~/20_newsgroups/alt.atheism/51254 alt.atheism people 1 jewish 3 game 1 bible 7
To print a non-sparse matrix, indicating the binary presence/absence of all words in the vocabulary for each document, use the format string ``abe''. The command
rainbow -d ~/model -T 10 --print-matrix=abe | head -n 10
reduces the vocabulary to only 10 words, then prints
~/20_newsgroups/alt.atheism/53366 alt.atheism 1 1 0 0 0 0 0 0 0 0 ~/20_newsgroups/alt.atheism/53367 alt.atheism 0 0 0 0 0 0 0 0 0 0 ~/20_newsgroups/alt.atheism/51247 alt.atheism 1 0 0 0 0 0 0 0 0 0 ~/20_newsgroups/alt.atheism/51248 alt.atheism 0 0 0 0 0 0 0 0 0 0 ~/20_newsgroups/alt.atheism/51249 alt.atheism 0 0 1 0 0 0 0 0 0 0 ~/20_newsgroups/alt.atheism/51250 alt.atheism 1 1 0 0 0 0 0 0 0 0 ~/20_newsgroups/alt.atheism/51251 alt.atheism 0 0 0 0 0 0 0 0 0 0 ~/20_newsgroups/alt.atheism/51252 alt.atheism 0 1 0 0 0 0 0 0 0 0 ~/20_newsgroups/alt.atheism/51253 alt.atheism 1 0 0 1 1 0 0 0 0 0 ~/20_newsgroups/alt.atheism/51254 alt.atheism 0 1 0 0 1 0 0 0 0 0
For a summary of all the diagnostic options, see the "Diagnostics" section of the rainbow --help output.
Rainbow prints messages about its progress to standard error as it runs. You can change the verbosity of these progress messages with the --verbosity=LEVEL (or -v option. The argument LEVEL should be an integer from 0 to 5, 0 being silent (no progress messages printed to standard error), and 5 being most verbose. The default is 2.
For example, the following command will print no progress messages.
rainbow -v 0 -d ~/model -I 10
Some of the progress messages print backspace characters in order to show running counters. When running rainbow with GDB inside an Emacs buffer, however, the backspace character is printed as a character escape sequence and fills the buffer. You can avoid printing progress messages that contain backspace characters by using the --no-backspaces (or -b) option.
Rainbow may use a pseudo-random number generator for several tasks, including the randomized test-train splits described in section 3.1. You can specify the seed for this random number generator using the --random-seed option. For example
rainbow -d ~/model -t 1 --test-set=0.3 --random-seed=2
You can verify that use of the same random seed results in identical test/train splits by using the --print-doc-names option. For example
rainbow -d ~/model --random-seed=1 --train-set=4pc --print-doc-names=trainwill perform the specified test/train split, then print only the training documents. The above command will produce the same output each time it is called. However, the above command with the --random-seed=1 option removed will print different document names each time.
If this option is not given, then the seed is set using the computer's real-time clock.