Data
SRAA: Simulated/Real/Aviation/Auto UseNet data [document classification]
73,218 UseNet articles from four discussion groups, for simulated auto
racing, simulated aviation, real autos, real aviation. I have often
used this data for binary classification---separating real from
simulated, and auto from aviation---making the point that the same data
can be classified different ways depending on the user's needs. This is
especially interesting for semi-supervised learning. This data was
gathered by Andrew McCallum while at Just Research.
Cora Citation Matching [reference matching, object correspondence]
Text of citations hand-clustered into groups referring to the same paper.
Cora Research Paper Classification [relational document classification]
Research papers classified into a topic hierarchy with 73 leaves. We
call this a relational data set, because the citations provide
relations among papers.
Cora Information Extraction [information extraction]
Research paper headers and citations, with labeled segments for
authors, title, institutions, venue, date, page numbers and several
other fields.
Frequently Asked Questions [information extraction]
Several UseNet FAQ's segmented into questions and answers. Data
gathered and labeled by Dayne Freitag and Andrew McCallum.
CMU Seminar Announcements [information extraction]
48 emailed seminar announcements, with labeled segments for speaker,
title, start-time, end-time. Labeled by Dayne Freitag.
Industry Sector [document classification]
Corporate web pages classified into a topic hierarchy with about 70 leaves.
20 Newsgroups [document classification]
About 20,000 UseNet postings from 20 newsgroups. Gathered by Ken Lang
at CMU in the mid-90's. This is the original set, without various
editing done by Jason Rennie and others.