Machine Learning and Friends Lunch

A Hierarchical, HMM based Automatic Evaluation of OCR Accuracy for a Digital Library of Books

Abstract

Content-based on line book retrieval usually requires first converting printed text into machine readable text using an OCR engine and then doing full text search on the results. Many of these books are old and there are a variety of processing steps that are required to create an end to end system. Changing any step can affect OCR performance and hence a good automatic statistical evaluation of OCR performance on book length material is needed. Evaluating OCR performance on the entire book is non-trivial. The only easily obtainable ground truth must be automatically aligned with the OCR output over the entire length of a book. This may be viewed as equivalent to the problem of aligning two large (easily a million long) sequences. The problem is further complicated by OCR errors as well as the possibility of large chunks of missing material in one of the sequences. I will describe a Hidden Markov Model (HMM) based hierarchical alignment algorithm to align OCR output and the ground truth for books. The alignment process works by breaking up the problem of aligning two long sequences into the problem of aligning many smaller subsequences. Joint work with R.Manmatha while visiting Google.

Back to ML Lunch home