Machine Learning and Friends Lunch





home
past talks
resources
conferences

Minimal Test Collections for Retrieval Evaluation


Ben Carterette
UMASS

Abstract

Accurate estimation of information retrieval evaluation metrics such as average precision require large sets of relevance judgments. Building sets large enough for evaluation of real-world implementations is at best inefficient, at worst infeasible. In this work we link evaluation with test collection construction to gain an understanding of the minimal judging effort that must be done to have high confidence in the outcome of an evaluation. A new way of looking at average precision leads to a natural algorithm for selecting documents to judge and allows us to estimate the degree of confidence by defining a distribution over possible document judgments. A study with annotators shows that this method can be used by a small group of researchers to rank a set of systems in under three hours with 95% confidence. Information retrieval metrics such as average precision require large sets of relevance judgments to be accurately estimated. Building these sets is infeasible and often inefficient for many real-world retrieval implementations. We present a new way of looking at average precision that allows us to estimate the confidence in an evaluation based on the size of the test collection. We use this to build an algorithm for selecting the best documents to judge to have maximum confidence in an evaluation with a minimal number of relevance judgments. A study with annotators shows how the algorithm can be used by a small group of researchers to quickly rank a set of systems with 95% confidence.

Back to ML Lunch home