Machine Learning and Friends Lunch |
||||
|
A Hierarchical, HMM based Automatic Evaluation of OCR Accuracy for a Digital Library of BooksShaolei Feng UMass Abstract
Content-based on line book retrieval usually requires first
converting printed text
into machine readable text using an OCR engine and then doing
full text search on the
results. Many of these books are old and there are a variety of
processing steps that
are required to create an end to end system. Changing any step
can affect OCR
performance and hence a good automatic statistical evaluation of
OCR performance on
book length material is needed. Evaluating OCR performance on
the
entire book is non-trivial. The only easily obtainable ground
truth must be
automatically aligned with the OCR output over the entire length
of a book. This may
be viewed as equivalent to the problem of aligning two large
(easily a million long)
sequences. The problem is further complicated by OCR errors as
well as the possibility
of large chunks of missing material in one of the sequences. I
will describe a Hidden
Markov Model (HMM) based hierarchical alignment algorithm to
align OCR output and the
ground truth for books. The alignment process works by breaking
up the problem of
aligning two long sequences into the problem of aligning many
smaller subsequences.
Joint work with R.Manmatha while visiting Google.
|