首页 | 本学科首页   官方微博 | 高级检索  
     检索      


Statistical learning for OCR error correction
Authors:Jie Mei  Aminul Islam  Abidalrahman Moh’d  Yajing Wu  Evangelos Milios
Institution:1. Faculty of Computer Science, Dalhousie University, Halifax, NS B3H 1W5, Canada;2. School of Computing and Informatics, University of Louisiana at Lafayette, Lafayette, LA 70503, Canada
Abstract:Modern OCR engines incorporate some form of error correction, typically based on dictionaries. However, there are still residual errors that decrease performance of natural language processing algorithms applied to OCR text. In this paper, we present a statistical learning model for post-processing OCR errors, either in a fully automatic manner or followed by minimal user interaction to further reduce error rate. Our model employs web-scale corpora and integrates a rich set of linguistic features. Through an interdependent learning pipeline, our model produces and continuously refines the error detection and suggestion of candidate corrections. Evaluated on a historical biology book with complex error patterns, our model outperforms various baseline methods in the automatic mode and shows an even greater advantage when involving minimal user interaction. Quantitative analysis of each computational step further suggests that our proposed model is well-suited for handling volatile and complex OCR error patterns, which are beyond the capabilities of error correction incorporated in OCR engines.
Keywords:OCR post-processing  OCR error  Error correction  Statistical learning
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号