MALLET is an integrated collection of Java code useful for statistical natural language processing, document classification, clustering, information extraction, and other machine learning applications to text.
It was written by Andrew McCallum, with contributions from several graduate students and staff, at University of Massachusetts Amherst, as well as contributions from Fernando Pereira, Ryan McDonald, and others.
Its development was funded by DARPA and the Airforce Research Laboratory (AFRL) under contract numbers F30602-00-2-0597 and F30602-01-2-0566, the National Science Foundation under grant EIA-9983215.
You might also be interested in other similar software packages for machine learning applied to text.
Although many portions of the toolkit are very stable and usable, the toolkit as a whole is still in quite early stages of development. There is unfortunately almost no documentation at this point; it will come eventually. In the mean time, there are good examples of code for front-end functionality in directories named "examples", and in source files named "TUI.java".
The toolkit is Open Source Software, and is released under the Common Public License.
McCallum, Andrew Kachites. "MALLET: A Machine Learning for Language Toolkit." http://www.cs.umass.edu/~mccallum/mallet. 2002.Here is a BiBTeX entry:
@unpublished{McCallumMALLET, author = "Andrew Kachites McCallum", title = "MALLET: A Machine Learning for Language Toolkit", note = "http://www.cs.umass.edu/~mccallum/mallet", year = 2002}
Source code for the library can be downloaded from this directory (sorry, first release not yet available, but will be soon). Different versions are indicated by eight digit sequences that indicate year, month and day. Thus, the most recent version is the one with the largest version number.
MALLET requires Java 1.4 or higher.
Unfortunately I do not have time to help all users with their compilation and usage problems. Feel free to send me mail asking for help, but please do not necessarily expect me to have time to help. Most appreciated are bug reports accompanied by fixes.