Machine Learning and Friends Lunch

Interactive Feature Selection in Active Learning for Text Classification

Abstract

The presence of large amounts of unlabeled data is a problem for many real world text classification problems. Asking a human to label this data is time consuming and expensive. The pool based query paradigm (Cohn, Atlas and Ladner, 1990), also called the active learning paradigm allows the learner access to a large amount of unlabelled data . The learner can choose whether or not to query an expert regarding the label of an instance. Much work has been done on active learning (Freund et al,1997; Lewis 1990 etc) and several techniques have been studied.

Feature selection can serve one of many different purposes, including improvements in performance (accuracy/error) and in the space and time efficiency of the classifier. Feature Selection has had mixed success in improving performance in the inductive learning paradigm, depending on the classifier being used and on the domain.

In this work we execute a careful study of the effects of feature selection on classifier performance when the training set is incrementally increased and in particular in active learning scenarios. We conduct a number of experiments using a "Feature Oracle" we have observed that feature selection can improve classifier accuracy significantly when we have few labeled examples.

We find that feature selection not only helps improve classifier accuracy, but also helps example selection in active learning. We also find that filter-based feature selection based on the information-gain of the features on the training set has very limited benefits for improving SVMs, but that extra infromation on the features, in our case in the form of human's prior knowledge, can accelerate the learning of the concept classes.

This is joint work done with Omid Madani and Rosie Jones at Yahoo! Research Labs.

Back to ML Lunch home