Total views : 224

Automatic Text Correction for Devanagari OCR

Affiliations

  • Department of Computer Science, Punjabi University, Patiala – 147002, Punjab, India

Abstract


Objectives: This paper proposes a new technique for correcting errors done by Devanagari OCR (Optical Character Reader) system based on confusion matrix. Methods/Statistical Analysis: Confusion matrix is generated from large corpus of Hindi. The system takes each word of OCR output and generate number of strings from topmost five confused characters for each character of input word along with probability of these strings for ranking. Each string is validated with the character trigram dictionary and these valid strings are used for best suggestions. Findings: The topmost five words is taken as suggestions. The system has been tested for variety of OCR outputs documents of Devanagari script. The system provides suggestions for all the correct words at top position. For more than 10000 unique words in Devanagari OCR output, system gives the accuracy of 97%. Application/Improvements: This system is used in post-processing of Devanagari OCR. With some improvements, the system can also be used for Gurumukhi Script and Urdu script.

Keywords

Automatic Text Correction, Confusion Matrix, Devanagari, OCR, Trigram.

Full Text:

 |  (PDF views: 229)

References


  • Kenneth WC, William AG. Probability scoring for spelling correction. Statistics and Computing. Dec 1991; 1(2):93103.
  • Karen K. Techniques for automatically correcting words in text. ACM Computing Surveys(CSUR). Dec 1992;24(4):377-439.
  • Mark DK, Kenneth WC, William AG. A Spelling correction program based on a noisy channel model. Proceedings of 13thconference on Computational linguistics.; USA 1990;p.205-10.
  • Rupy J, Santanu C. Probabilistic approach for correction of optically-character-recognized strings using suffix tree. Proceedings of the 3rd National Conference on Computer Vision. Pattern Recognition, Image Processing and Graphics; India 2011:p.74-7.
  • Masaaki N. Japanese OCR error correction using character shape similarity and statistical language model. Proceedings of the 17th international conference on Computational linguistics; USA 1998:p.922 – 8.
  • Gurpreet SL, Chandan S, Ritu L. A shape based post processor for gurumukhi OCR. Proceedings of the Sixth International Conference on Document Analysis and Recognition.; USA 2001:p.1105-9.
  • Dharam VS, Gurpreet SL, Sarita M. Shape encoded post processing of gurumukhi OCR. Proceedings of 10th International Conference on Document Analysis and Recognition.; USA 2009:p.788-92.
  • Umapada P, Pulak KK, Bidyut BC. OCR error correction of an inflectional Indian language using morphological parsing. Journal of Information Science and Engineering. Nov 2000; 16:.903-22.
  • Okan K, Philip R. OCR post-processing for low density languages. Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing;. USA 2005:p.867-74.
  • Youssef B, Mohammad A. OCR post-processing error correction algorithm using Google’s online spelling suggestion.Journal of Emerging Trends in Computing and Information Sciences. Jan 2012; 3(1):90-99.
  • Karthika M C VJ. A post-processing scheme for Malayalam using statistical sub-character language models. Proceedings of the 9th IAPR International Workshop on Document Analysis System, USA., 2010:p.493-500.
  • Jawahar C V, Pavin MNSSK, Ravi Kiran SS. A bilingual OCR for Hindi-Telugu documents and its applications. Proceedings of the 7th international conference on document analysis and recognition. USA., 2003:p.408-12.
  • Jyothi J, Manjusha K, Anand Kumar M, Soman P. Innovative feature sets for machine learning based Telugu character recognition. Indian Journal of Science and Technology.Sept 2015; 8(24):1-7.
  • Ankur R, Gurpreet SL. Offline Urdu OCR using ligature based segmentation for Nastaliq Script. Indian Journal of Science and Technology. Dec 2015;8(35):1-9.
  • Atul K.A Survey on various OCR errors. International Journal of Computer Applications.Jun 2016;143(4):8-10.

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.