UMass Machine Learning and Friends Lunch | Main / One-class Clustering For Web Mining And Information Retrieval

Many text datasets consist of a coherent subset of documents along with unconsolidated noise. In Web information retrieval, for example, ranked lists often consist of a subset of relevant documents (that are topically related to each other), while the rest are documents that can be on any topic (they only accidentally contain the query terms). Our objective is to identify this coherent subset ("the core"). Based on the assumption that the core documents share a relatively small group of words, which presumably describe the core topic, we notice that the problem can be dually formulated as identifying core documents or identifying topical words which will then lead to identifying core documents. First, we will show analytically that, under certain generative assumptions, topical words can be successfully identified even in collections of moderate size. We will then formulate the learning problem as one-class clustering (OCC) that optimizes an information-theoretic objective function. We will propose a simple OCC algorithm that is optimal for text collections under the imposed assumptions. We will further relax the generative assumptions and propose two more advanced OCC methods (based on co-clustering and on Bayesian inference). We will demonstrate the performance of our methods on two applications: Web appearance disambiguation and reranking Web retrieval results.