Enron Email Dataset is distributed
by William Cohen. The dataset
consists of 517,431 messages that belong to 150 users, mostly senior management
of the Enron Corp. Although the dataset is
huge, topical folders of particular users are often quite sparse. We use email
directories of seven users which are especially large. The users are: Sally
Beck (Chief Operating Officer), Darren Farmer (Logistics Manager), Vincent
Kaminski (Head of Quantitative Modeling Group), Louise Kitchen (President of
EnronOnline), Michelle Lokay (Administrative Assistant), Richard Sanders
(Assistant General Counsel) and William Williams III (Senior Analyst).
Preprocessing. We
remove non-topical folders: all_documents, calendar, contacts,
deleted_items, discussion_threads, inbox, notes_inbox, sent, sent_items and
_sent_mail. We then flatten all the folder hierarchies. After that we
remove all folders that contain less than three messages. We also remove X-folder
field from message headers that actually contains the class label. We do not
entirely remove message headers. See the table below for statistics on the
seven preprocessed datasets:
|
User |
Number of
folders |
Number of
messages |
Messages
in smallest folder |
Messages
in largest folder |
|
beck-s |
101 |
1971 |
3 |
166 |
|
farmer-d |
25 |
3672 |
5 |
1192 |
|
kaminski-v |
41 |
4477 |
3 |
547 |
|
kitchen-l |
47 |
4015 |
5 |
715 |
|
lokay-m |
11 |
2489 |
6 |
1159 |
|
sanders-r |
30 |
1188 |
4 |
420 |
|
williams-w3 |
18 |
2769 |
3 |
1398 |
Download seven
preprocessed datasets (14.7 Mb tarred, gzipped).
Experimental setup. We apply four classifiers (MaxEnt, Naive Bayes, SVM
and Winnow). We use Mallet implementations
of MaxEnt, Naive Bayes and Winnow (Avrim
Blum’s version), and SVMlight
implementation of Support Vector Machines. Since email is heavily
time-dependent, we cannot use standard random splits for training and test
sets. We sort all messages according to the Date field and apply
incremental timeline splits: we initially train on the first 100 messages and
test on the following 100 messages, then we train on the first 200 messages and
test on the following 100 messages etc.
Download dataset
timelines (130 Kb tarred, gzipped).
Classification results. We report on accuracy over the timeline train/test
splits. Click here to see accuracy/timeline
graphs of the seven datasets. As it can be seen on the graphs, MaxEnt, SVM and
Winnow show similar results, while the results of Naive Bayes are significantly
worse. Overall, the results are surprisingly low, probably due to the fact that
we apply no feature selection. On one dataset (williams-w3) the results
are extremely high, while it can be seen from the table above that one half of
the dataset belongs to one category, so it is probably not an interesting
dataset.
We also plot accuracy over the percentage of test set coverage. For each split,
after performing the actual classification, we sort all the test messages
according to a classification score and threshold the sorted list so that first
10%, 20%, ..., 100% of messages are chosen. Then we calculate accuracy at each
of the 10 thresholds. After that, we average accuracies at each threshold over
all the train/test splits. We report on mean accuracy and standard error of the
mean at each threshold. Click here
to see accuracy/coverage graphs of the seven datasets.
Download all the
graphs in EPS, FIG, JPG and PDF formats (1.4 Mb tarred, gzipped).
Publication. We
present an extensive case study of email foldering (including the proposal of
the evaluation method, discussion on various design choices, application of the
four classifiers and comparative analysis of their results) in the following
paper:
|
|
Automatic Categorization of Email into Folders: Benchmark Experiments on Enron and SRI Corpora. Joint work with A. McCallum and G. Huang. CIIR Technical Report IR-418 2004 ps pdf |