UMass Machine Learning and Friends Lunch | Main / Mining A Million Books Partial Duplicate Detection Translation Identification And OCR Evaluation

Abstract: Large scanned book collections ( such as the Internet Archive and Google books ) lead to new interesting research problems. One can look at how individual books link to each other by finding overlapping or translated content. One can also look at the problem of large scale OCR evaluation. Here we adopt an alignment approach for solving these problems. The main challenge is that scanned books are typically very long and noisy ( OCR errors ) texts containing hundreds of thousands of words. Standard sequence alignment algorithms do not scale up at this level. Therefore we propose a compact representation for long noisy texts which enables fast and accurate sequence alignment and analysis. The idea is to use the sequence of words which appear only once ( referred as ``unique'' words ) to represent each book. Along with the sequence information, unique words are highly descriptive of the content and the flow of ideas in the book. It is shown that the proposed representation produces accurate results and provides dramatic speed-ups for partial duplicate and translation detection problems. We also propose a REcursive Text Alignment Scheme (RETAS) which uses the sequence of unique words to effectively guide the alignment providing dramatic time savings. This technique is later used for automatic OCR evaluation of books. This is joint work with R. Manmatha and Ethem Can.

Bio: Ismet Zeki Yalniz is a Ph.D. candidate at the Department of Computer Science, University of Massachusetts at Amherst, MA, USA. He earned his M.S. and B.S. degrees in computer engineering from Bilkent University, Ankara, Turkey, in 2008 and 2006, respectively. He is broadly interested in combining computer vision and information retrieval concepts to offer practical solutions for data and/or computation intensive problems. On the computer vision side, he worked on texture analysis and segmentation, video event detection, document image analysis and recognition. On the information retrieval side, he worked on the retrieval of noisy (OCRed) text documents focusing on OCR evaluation and error correction, duplicate document detection, and translation identification. His current research focus is on the development of effective and fast image search techniques.