Andrew McCallum UMass logo


SRAA: Simulated/Real/Aviation/Auto UseNet data [document classification]
73,218 UseNet articles from four discussion groups, for simulated auto racing, simulated aviation, real autos, real aviation. I have often used this data for binary classification---separating real from simulated, and auto from aviation---making the point that the same data can be classified different ways depending on the user's needs. This is especially interesting for semi-supervised learning. This data was gathered by Andrew McCallum while at Just Research.

Cora Citation Matching [reference matching, object correspondence]
Text of citations hand-clustered into groups referring to the same paper.

Cora Research Paper Classification [relational document classification]
Research papers classified into a topic hierarchy with 73 leaves. We call this a relational data set, because the citations provide relations among papers.

Cora Information Extraction [information extraction]
Research paper headers and citations, with labeled segments for authors, title, institutions, venue, date, page numbers and several other fields.

Frequently Asked Questions [information extraction]
Several UseNet FAQ's segmented into questions and answers. Data gathered and labeled by Dayne Freitag and Andrew McCallum.

CMU Seminar Announcements [information extraction]
48 emailed seminar announcements, with labeled segments for speaker, title, start-time, end-time. Labeled by Dayne Freitag.

Industry Sector [document classification]
Corporate web pages classified into a topic hierarchy with about 70 leaves.

20 Newsgroups [document classification]
About 20,000 UseNet postings from 20 newsgroups. Gathered by Ken Lang at CMU in the mid-90's. This is the original set, without various editing done by Jason Rennie and others.