Total views : 263

Preparation of a Dataset and Issues Related with Recognition of Optical Character in Assamese Script

Affiliations

  • Department of Computer Science and IT, Cotton College State University, India
  • Department of Computer Science, Gauhati University, India

Abstract


According to the website ‘ethnologue.com’, which does a lot of survey and statistical analysis on languages, has mentioned that currently 7102 living languages are available on earth. Recent trend is that the number of living languages is always going down, which is becoming an alarming matter. An article published by UNESCO in 2009, says that most of the endangered languages belong to India. In this digital era, we can keep a language alive, if it can be highly used in computers; software applications with interface in regional language. In this context, researchers from this region are working for developing an Optical Character Recognition system that can digitize the optical image written in major North-East Indian language. As the characteristics of scripts vary from one another so are the challenges. Keeping in mind the need of the researcher, we have developed a novel offline dataset of Assamese Historical and Machine Printed as well as handwritten documents, which could be used for experimentation of various techniques for Assamese character recognition task. The dataset comprise of a variety of modern and old Assamese texts that are collected from a variety of sources, which can be broadly divided into Machine printed and Handwritten documents. Both good quality and degraded documents are available in the dataset. Many researchers are working for the development of an OCR system for Assamese script; however there are a lot of challenges that need to be addressed. Discussion of various issues related with degraded text, historical documents, handwritten Assamese text and machine printed texts with reference to the data sample available in the dataset are mentioned here. Problems related with segmentation of characters in touching characters, difficulty in determining compound character and touching character. Skewed document and how its variation makes line segmentation difficult. Heavily printed documents make feature extraction a complicated task. In the dataset we have pages with backside text visible, making the document a noisy one. Besides, all these inherent issues of character recognition, issues related with recognition of old Assamese script is also discussed in detail. This dataset will be of ample use and the issues we have discussed will certainly increase attraction of researchers working in this field. More research and innovation with digitization of Assamese documents, books and historical documents will definitely help sustainability of the language and the script as well.

Keywords

Assamese Character Recognition, Dataset of Major North-East Indian Script, Document Analysis and Retrieval, Historical Document.

Full Text:

 |  (PDF views: 304)

References


  • Tappert CC, Suen CY, Wakahara T. State of the art in online Hand-Writing Recognition. IEEE Transactions on Pattern analysis and Machine Intelligence. 1990 Aug; 12(8):787-809.
  • Lehal GS, Chandan Singh. A Gurumukhi Script Recognition System. Proceeding of 15th International Conference on Pattern Recognition, Spain, 2000, 2:557-60.
  • Aarthi R. Anjana KP, Amudha J. Sketch based Image Retrieval using Information Content of Orientation. Indian Journal of Science and Technology. 2016 Jan; 9(1). Doi: 10.17485/ijst/2016/v9i1/73218
  • Pal U, Datta S. Segmentation of Bangla Unconstrained Handwritten Text. Proceedings of the 7th International Conference, ICDAR, 2003, p.1128-32.
  • Pal U, Chaudhuri BB. Printed Devanagari Script OCR System. Vivek, 1997, 10, p.12-24.
  • Sarkar R, Das N, Basu S. CMATERdb1: a dataset of unconstrained handwritten Bangla and Bangla–English mixed script document image, IJDAR, 2012, 15, p.71-83.
  • Garain U, Chaudhuri BB. Segmentation of touching characters in Printed Devnagari and Bangla Scripts using Fuzzy Multifactorial analysis. IEEE Transactions on Systems, Man, and Cybernetics—Part C: Applications and Reviews. 2002 Nov; 32(4).
  • Pal U. Handwriting Recognition in Indian Regional Scripts: A Survey of Offline Techniques. ACM Transactions on Asian Language Information Processing, 2012 Mar; 11(1), Article 1.
  • Casy RG, Lecolinet E. A Survey of Methods and Strategies in Character Segmentation. IEEE Transactions on Patterns Analysis and Machine Intelligence. 1996; 18(8):690-706.
  • Shridar, Badredlin M. Recognition of Isolated and Simple Connected Handwritten Numerals. Pattern Recignition, 1986.
  • Prabhu V, Gunasekaran G. Fuzzy Logic based Nam Speech Recognition for Tamil Syllables. Indian Journal of Science and Technology. 2016 Jan; 9(1). Doi: 10.17485/ijst/2016/v9i1/85763.
  • Lu Y, Shridhar M. Character Segmentation In: Hand wriiten Words - An Overview, Pattern Recognition, 1996, p.77-84.
  • Pradeepta K. Sarangi P. Ahmed Kiran K. Ravulakollu. Naïve Bayes Classifier with LU Factorization for Recognition of Handwritten Odia Numerals. Indian Journal of Science and Technology. 2014 Jan; 7(1). Doi no:10.17485/ijst/2014/v7i1/46677
  • Surinta O, Karaaba M, Schomaker LB, Wiering M. Recognition of handwritten characters using local gradient feature descriptors. Engineering Applications of Artificial Intelligence. 2015; 405-14.
  • Rani R, Dhir R, Lahel GS. Comparative analysis of Gabor and discriminating feature extraction techniques for script identification. Proceedings of ICISIL, Patiala, 2011, p. 174-79.
  • Louloudis G, Gatos B, Pratikakis I, Halatsis K. A Block Based Hough Transform Mapping For Text Line Detection in Handwritten Documents. Proceedings of the Tenth International Workshop on Frontiers in Handwriting Recognition, 2006, p. 515-520.
  • Chaudhuri BB, Bera S. Handwritten Text Line Identification in Indian Scripts. 10th International Conference on Document Analysis and Recognition, 2009.
  • Congedo G, Dimauro G. Impedovo S. Pirlo G. Segmentation of Numeric Strings, Proceedings of Third International Conference on Document Analysis and Recognition, Montreal, 1995 Aug, 14-16.
  • Fenrich R, Krishnamoorthy K. Segmenting Diverse Quality Handwritten Digit Strings in Near Real-Time, Proceedings of The 4th Advanced Technology Conference, 1990, p. 523-37.
  • Shyni SM, Antony Robert Raj M, Abirami S. Offline Tamil Handwritten Character Recognition Using Sub Line Direction and Bounding Box Techniques. Indian Journal of Science and Technology. 2015 Apr; 8(S7). Doi: 10.17485/ijst/2015/v8iS7/67780
  • Jindal MK, Sharma RK, Lehal GS. Segmentation of Touching Characters in Upper Zone in Printed Gurmukhi Script. Proceedings of the 2nd Bangalore Annual Compute Conference, Banglore, ACM, 2009, 9.
  • Zahour A, Taconet B, Mercy P, Ramdane S. Arabic Hand-Written Text-Line Extraction. Proceedings of the Sixth International. Conference on Document Analysis And Recognition, ICDAR, 2001, p. 281–285.
  • Roy P, Das PK. A Hybrid VQ-GMM Approach for Identifying Indian Languages. Springer, International Journal of Speech Technology. 2012 Jun; 15(2). DOI: 10.1007/s10772-012-9152-6
  • Pal U, Chaudhuri BB, Belaid A. A Complete System for Bangla Handwritten Numeral Recognition. IETE Journal of Research. 2006; 52(1):27-34.
  • Tripathy N, Pal U. Handwriting Segmentation of Unconstrained Oriya Text. International Workshop on Frontiers in Handwriting Recognition, 2004, p. 306-11.
  • Bukhari SS, Shafait F, Breuel TM. Script-independent handwritten Text lines Segmentation Using Active Contours, ICDAR, 2009, p. 446-50.
  • Jindal MK, Lehal GS, Sharma RK. On Segmentation of Touching Characters and Overlapping Lines in Degraded Printed Gurmukhi Script. International Journal of Image and Graphics (IJIG), World Scientific Publishing Company. 2009; 9(3):321-53.
  • Chakraborty D, Pal U. Baseline detection of multi-lingual unconstrained handwritten text lines. Pattern Recognition Letters. 2016; 74:74-81.
  • Avidan S, Shamir A. Seam Carving for content-aware image resizing. ACM Trans. Graph. 2007; 26(3):10.
  • Chaudhuri BB, Pal U, Mitra M. Automatic recognition of Printed Oriya Script. Sadhana. 2002 Feb; 27(1):23-34.
  • Rani R, Dhir R, Lehal GS. Structural and Gabor Features forScript Identification of Gurumukhi and English words. International Journal of Signal and Image Processing. 2014b; 4(1):79-84. ISSN No. 2005-4254.
  • Long T, Jin L. Building compact MQDF classifier for large character set recognition by subspace distribution sharing. Pattern Recognition. February 2008; 2916-25.
  • Bag S, Harit G, Bhowmick P. Recognition of Bangla compound characters using structural decomposition. Pattern Recognition. 2011; 47:1187-201.

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.