Offline Urdu OCR using Ligature based Segmentation for Nastaliq Script

There are two most popular writing styles of Urdu i.e. Naskh and Nastaliq. Considering Arabic OCR research, ample amount of work has been done on Naskh writing style; focusing on Urdu, which uses Arabic character set commonly used Nastaliq writing style. Due to Nastaliq writing style, Urdu OCR poses many distinct challenges like compactness, diagonal orientation and context character shape sensitivity etc., for OCR system to correctly recognize the Urdu text image. Due to compactness and slanting nature of Nastaliq writing style, existing methods for Naskh style would not give desirable results. Therefore, in this paper, we are presenting ligature based segmentation OCR system for Urdu Nastaliq script. We have discussed in detail various unique challenges for the Urdu OCR and different feature extraction techniques for Ligature recognition using SVM and kNN classifier. The system is trained to recognize 11,000 Urdu ligatures. We have achieved overall 90.29% accuracy tested on Urdu text images. Offline Urdu OCR using Ligature based Segmentation for Nastaliq Script Ankur Rana1* and Gurpreet Singh Lehal2 1Department of Computer Engineering, Punjabi University, Patiala 147002, Punjab, India; ankurrana628@gmail.com 2Department of Computer Science, Punjabi University, Patiala 147002, Punjab, India; gslehal@gmail.com


Introduction
Optical character recognition is technique used to convert images data into editable text format.Optical character recognition is used for digitization of the printed text material like books, newspapers and old literature book, blind book reader, banks etc.A lot of research is being done for Arabic languages.Urdu also shares its character set with the Arabic character set.Generally Arabic language is written from left and right in Naskh writing style whereas Urdu is written in Nastaliq writing style from left to right.Nastaliq writing style is cursive writing style.Therefore very little work has been done for Urdu language because of Nastaliq writing style.For the cursive script like Urdu, segmentation and segmentation free approaches were being used by the researchers.Most of the researchers either worked on isolated Urdu character recognition or the small set of the Urdu ligature.
Shamsher et al. 1 and Ahmad et al. 2 developed OCR for printed Urdu script.They used feed forward neural network for training and classification of the Urdu character.Shamsher et al. 1 extracted the extreme points of all characters as feature vector for the neural network.They reported 98.3% accuracy in recognizing individual Urdu character.Ahmad et al. 2 experimented with the structural features of the machine printed ligatures.They reported 70% accuracy.Pathan 3 worked on the handwritten isolated Urdu characters.
Al Muhtaseb et al. 4 used HMM for developing Urdu OCR.They used vertical sliding with three pixels width and extracted 16 features by dividing window into 8 equal sizes and the sum up number of black pixels in each subwindow.They used 2500 lines for the training and 266 lines for testing.They achieved 99% accuracy for the Arial font.Their system is font style and font size depended.
Hasan et al. 5 used LSTM (Long Short-Term Memory) neural network to recognize the printed Urdu Nastaliq text.They also used sliding window of width 30 for feature extraction.They took pixels values as the feature vector.They reported 94.85% character recognition accuracy.
Lehal and Rana 6 explored ligatures recognition for the Urdu OCR.They have divided ligature into two

Overview of Urdu
Urdu is the national language of the Pakistan and official language of six Indian states Delhi, Jammu and Kashmir, Uttar Pradesh, Bihar, Andhra Pradesh and Telangana.Urdu is the national language of the Pakistan and official language of six Indian states Delhi, Jammu and Kashmir, Uttar Pradesh, Bihar, Andhra Pradesh and Telangana.Urdu language is derived from the Farsi script.Urdu language is particularly important as it has been vastly used by poets for composing their poetry.Urdu is written in Nastaliq style whereas Arabic is written in the Naskh style.Figure 1 shows the same Urdu text in two different writing styles.
Urdu has 38 basic letters.Figure 2 shows all the basic letters of Urdu script.It is written from right to left whereas Urdu numerals are written as roman numerals i.e. from left to right.
Urdu characters are classified as joiner and non-joiner.Those Urdu characters which join with the preceding character but not with the succeeding character termed as non-joiner.There are 12 non joiners in the Urdu as shown in Figure 3. Rest of character exception these 12 non joiners connect with succeeding and proceeding character and change its shape.

Challenges in Urdu Recognition
The development of OCR for Urdu Script involves many unique complexities which are as follows: Urdu is written diagonally from right to left and top • to bottom.All ligatures look tilted at some angle from top right to bottom left direction as the different joiners are written in Urdu.Numerals add another level of complexity to the Urdu OCR.As observed from the books having both Roman as well as Arabic Numerals written.It is found that Urdu and Roman numerals are written left to right in the Urdu as show in Figure 4.
As Urdu is mostly written diagonally from right top • to bottom left, there is problem of segmentation.This poses problem for the character and word segmentation.Figure 5 exhibit shows overlapped ligatures marked as in red, green and blue color.Urdu characters are classified as joiner and non-joiner.Those Urdu characters which join with character but not with the succeeding character termed as non-joiner.There are 12 non joiners shown in Figure 3. Rest of character exception these 12 non joiners connect with succeeding an character and change its shape.

Challenges in Urdu Recognition
The development of OCR for Urdu Script involves many unique complexities which are as follows:  Urdu is written diagonally from right to left and top to bottom.All ligatures look tilted at som top right to bottom left direction as the different joiners are written in Urdu.Numerals add of complexity to the Urdu OCR.As observed from the books having both Roman as w Numerals written.It is found that Urdu and Roman numerals are written left to right in the in Figure 4.    Urdu is context sensitive script as well.Context sensitive means that the shape of the chara on the succeeding character.When different characters are written together, the shape of depends on the shape of the character it follows as shown in table .Character in red color in side of the equal to is the character which when written with other characters changes its We have considered 10082ligatures for the development of Urdu OCR.Our approach is segmentation based approach.We are considering pre-segmented text for the recognition.We have divided our ligatures into two sub components i.e. i) Primary Component and ii) Secondary Component.So our total classes are reduced to 1845 classes for primary component and 19 classes for the secondary components.We have achieved total 98.15% accuracy for primary component and 99.91% accuracy for the secondary components.Total overall accuracy on input of Urdu text image (having 500 ligatures on average) achieved for the Urdu OCR is 90.29% accuracy.

Overview of Urdu
Urdu is the national language of the Pakistan and official language of six Indian states Delhi, Jammu and Kashmir, Uttar Pradesh, Bihar, Andhra Pradesh and Telangana.Urdu is the national language of the Pakistan and official language of six Indian states Delhi, Jammu and Kashmir, Uttar Pradesh, Bihar, Andhra Pradesh and Telangana.Urdu language is derived from the Farsi script.Urdu language is particularly important as it has been vastly used by poets for composing their poetry.Urdu is written in Nastaliq style whereas Arabic is written in the Naskh style.Figure 1 shows the same Urdu text in two different writing styles.

Challenges in Urdu Recognition
The

Challenges in Urdu Recognition
The development of OCR for Urdu Script involves many  Urdu is written diagonally from right to left and to top right to bottom left direction as the different of complexity to the Urdu OCR.As observed fr Numerals written.It is found that Urdu and Roma in Figure 4.   Urdu characters are classified as joiner and non-joiner.Those Urdu characters which join with t character but not with the succeeding character termed as non-joiner.There are 12 non joiners i shown in Figure 3. Rest of character exception these 12 non joiners connect with succeeding an character and change its shape.

Challenges in Urdu Recognition
The development of OCR for Urdu Script involves many unique complexities which are as follows:  Urdu is written diagonally from right to left and top to bottom.All ligatures look tilted at som top right to bottom left direction as the different joiners are written in Urdu.Numerals add of complexity to the Urdu OCR.As observed from the books having both Roman as w Numerals written.It is found that Urdu and Roman numerals are written left to right in the in Figure 4.     One of the major problems of the Urdu OCR is the broken character in the image.If one of the diacritics is not printed or not correctly scanned then the OCR cannot correctly identify it.In Figure 6 ligature is broken at the end.Consequently, OCR will get confused whether to treat broken ligature as one component or more than one. Another problem with the Urdu OCR is the case merged characters.In Urdu different ligatures have same basic shape but different diacritics.From the diacritics of the ligature, we can identify the correct meaning of the ligature.But some time diacritics get merged with the primary shape of the ligature and it is difficult to segment even ligature primary shape and diacritics as shown in Figure 7.  There is very little space in between words in Urdu.Different ligatures either have space or non-joiner at end as a last character.As shown in Figure 8, bar in green color shows the boundary between different words where bar in red bar represent the space between two ligatures in one words.From the space between ligature and words, we cannot find the word boundary between different words.
There is very little space in between words in Urdu.• Different ligatures either have space or non-joiner at end as a last character.As shown in Figure 8, bar in green color shows the boundary between different words where bar in red bar represent the space between two ligatures in one words.From the space between ligature and words, we cannot find the word boundary between different words.Line segmentation also pose problem in Urdu OCR.

•
Because of the diagonal writing style of the script, two ligatures in two difference lines get merged as shown in Figure 9.
It is observed in some Urdu books footer of the page • is written in Naskh style which add another complexity to the recognition.Having both Naskh and Nastliq writing style on one page adds another level of challenge in correct recognition of the page.As shown in Figure 10, Urdu text in green color is in nastaliq writing style and red color Urdu text is in Naskh writing style.Having both type of script on page is type of multiscript recognition which further increases the complexity of the system.

Urdu Data Preparation
Urdu text printed in books and newspapers can be divided into two generations.The books printed before 1995 are Urdu is context sensitive script as well.Context sensitive • means that the shape of the character depends on the succeeding character.When different characters are written together, the shape of the character depends on the shape of the character it follows as shown in table .Character in red color in the left hand side of the equal to is the character which when written with other characters changes its shape on the right hand side, because of the context sensitive nature of script, where in Naskh each character has its four shapes i.e. isolated, middle, left and right as show in Table .But Urdu in Nastaliq script has more than four shapes for its character set.Naz 9 reported 32 different shapes of the one character in nastaliq script.
One of the major problems of the Urdu OCR is the • broken character in the image.If one of the diacritics is not printed or not correctly scanned then the OCR cannot correctly identify it.In Figure 6 ligature is broken at the end.Consequently, OCR will get confused whether to treat broken ligature as one component or more than one.
Another problem with the Urdu OCR is the case • merged characters.In Urdu different ligatures have same basic shape but different diacritics.From the diacritics of the ligature, we can identify the correct meaning of the ligature.But some time diacritics get merged with the primary shape of the ligature and it is difficult to segment even ligature primary shape and diacritics as shown in Figure 7.

Urdu Data Preparation
Urdu text printed in books and newspapers can be divided into two generatio 1995 are all hand written while majority of the books published after 1995 use fonts such as Alvi Nastaliq or Noori Nastaliq.Shape of the characters in Urdu character it follows.That's why we cannot take the character unit as the classifi Ligature segmentation at character level is shown in Figure 11.
Different Shape of tey, tay and meem with different other characters e major problems of the Urdu OCR is the broken character in the image.If one of the diacritics is d or not correctly scanned then the OCR cannot correctly identify it.In Figure 6 ligature is t the end.Consequently, OCR will get confused whether to treat broken ligature as one nt or more than one.roblem with the Urdu OCR is the case merged characters.In Urdu different ligatures have same e but different diacritics.From the diacritics of the ligature, we can identify the correct meaning ture.But some time diacritics get merged with the primary shape of the ligature and it is difficult  1. Different Shape of tey, tay and meem with different other characters e major problems of the Urdu OCR is the broken character in the image.If one of the diacritics is d or not correctly scanned then the OCR cannot correctly identify it.In Figure 6 ligature is t the end.Consequently, OCR will get confused whether to treat broken ligature as one t or more than one.roblem with the Urdu OCR is the case merged characters.In Urdu different ligatures have same e but different diacritics.From the diacritics of the ligature, we can identify the correct meaning ture.But some time diacritics get merged with the primary shape of the ligature and it is difficult t even ligature primary shape and diacritics as shown in Figure 7. Journal 1. Different Shape of tey, tay and meem with different oth  One of the major problems of the Urdu OCR is the broken character in the not printed or not correctly scanned then the OCR cannot correctly id broken at the end.Consequently, OCR will get confused whether to component or more than one. Another problem with the Urdu OCR is the case merged characters.In Ur basic shape but different diacritics.From the diacritics of the ligature, we of the ligature.But some time diacritics get merged with the primary shap to segment even ligature primary shape and diacritics as shown in Figure 7  Line segmentation also pose problem in Urdu OCR.Because of the diagona ligatures in two difference lines get merged as shown in Figure 9.   1. Different Shape of tey, tay and meem with differen  One of the major problems of the Urdu OCR is the broken character in not printed or not correctly scanned then the OCR cannot correct broken at the end.Consequently, OCR will get confused whethe component or more than one. Another problem with the Urdu OCR is the case merged characters.In basic shape but different diacritics.From the diacritics of the ligature, of the ligature.But some time diacritics get merged with the primary s to segment even ligature primary shape and diacritics as shown in Figu  There is very little space in between words in Urdu.Different ligature end as a last character.As shown in Figure 8, bar in green color sho words where bar in red bar represent the space between two ligat between ligature and words, we cannot find the word boundary betwe  Line segmentation also pose problem in Urdu OCR.Because of the dia ligatures in two difference lines get merged as shown in Figure 9.We have gathered training data from various scanned books.Some of the least frequent ligatures, as mentioned by Lehal 10 , rarely occur in some books.To generate the training data for those primary components of the ligature, we made some synthetic images with different font size i.e. 35, 38, 40, 45, 50, 55 and different format option like bold or regular.We have removed diacritics marks from these ligatures to get the primary component.We have gathered total 1200 primary component from the scanned books and rest of the primary components were generated synthetically.Even in 1200 primary component, some of primary component samples are less than the required number of samples.To complete the samples we mixed synthetic and scanned books samples.We have also trained Urdu numerals and roman numerals as the primary component.Sample of the primary components are shown in Figure 13.
We have also collected samples of the merged ligatures.Merged ligature are those ligature in which diacritics merge with the primary component shape.As merged characters or touching character pose problem for the recognition in any OCR system.After analyzing scanned images from books, we have found some diacritics components merged with the primary component shape.We have collected total 41 such primary components merged with the existing primary components.Some of the diacritics touching primary components are shown in the Figure 14.
We have collected150 samples for each secondary component from the scanned books.Sample of the secondary component is shown in the Figure 15: Complete statistics of training data is given in following Table 2. all hand written while majority of the books published after 1995 use computer generated Nastaliq fonts such as Alvi Nastaliq or Noori Nastaliq.Shape of the characters in Urdu depends on the shape of the character it follows.That's why we cannot take the character unit as the classification unit for the recognition.Ligature segmentation at character level is shown in Figure 11.
For the development of Urdu OCR we have taken ligatures as a classification unit.Ligature is the connected component and has different Urdu characters and end character is either non joiner or space.Urdu words can have more than one ligature.As for example, the word ( Page 5 les (multiscripts on one page).rs can be divided into two generations.The books printed before f the books published after 1995 use computer generated Nastaliq liq.Shape of the characters in Urdu depends on the shape of the take the character unit as the classification unit for the recognition.hown in Figure 11

Urdu Data Preparation
Urdu text printed in books and newspapers can be divided into two generations.The books printed before 1995 are all hand written while majority of the books published after 1995 use computer generated Nastaliq fonts such as Alvi Nastaliq or Noori Nastaliq.Shape of the characters in Urdu depends on the shape of the character it follows.That's why we cannot take the character unit as the classification unit for the recognition.
Ligature segmentation at character level is shown in Figure 11.
= We have developed our proposed system with these 10082 ligatures.But to collect the training data for 10082 ligatures from the books is very cumbersome task.To decrease the number of recognition classes, Lehal 10 has separated ligature into primary and secondary component as shown in Figure 12.

Urdu Data Preparation
Urdu text printed in books and newspapers can be divided into two generations.The books printed before 1995 are all hand written while majority of the books published after 1995 use computer generated Nastaliq fonts such as Alvi Nastaliq or Noori Nastaliq.Shape of the characters in Urdu depends on the shape of the character it follows.That's why we cannot take the character unit as the classification unit for the recognition.
Ligature segmentation at character level is shown in Figure 11.
= We have developed our proposed system with these 10082 ligatures.But to collect the training data for 10082 ligatures from the books is very cumbersome task.To decrease the number of recognition classes, Lehal 10 has separated ligature into primary and secondary component as shown in Figure 12.After removing diacritics from the 10082ligatures and grouping ligatures having same primary component, we have 1845 classes of the primary component and 16 secondary components.When these secondary components are used along with 1845 primary component we get a total of 10082ligatures.
We have gathered training data from various scanned books.Some of the least frequent ligatures, as mentioned by Lehal   10   , rarely occur in some books.To generate the training data for those primary components of the ligature, we made some synthetic images with different font size i.e. 35, 38, 40, 45, 50, 55 and different format option like bold or regular.We have removed diacritics marks from these ligatures to get the primary component.We have gathered total 1200 primary component from the scanned books and rest of the primary We have developed our proposed system with these 10082 ligatures.But to collect the training data for 10082 ligatures from the books is very cumbersome task.To decrease the number of recognition classes, Lehal 10 has separated ligature into primary and secondary component as shown in Figure 12.
After removing diacritics from the 10082ligatures and grouping ligatures having same primary component, we have 1845 classes of the primary component and 16 secondary components.When these secondary components are used along with 1845 primary component we get a total of 10082ligatures.hered training data from various scanned books.Some of the least frequent ligatures, as Lehal   10   , rarely occur in some books.To generate the training data for those primary components llenge in correct recognition of the page.As shown in figure 10, Urdu text in green color is in nastaliq g style and red color Urdu text is in Naskh writing style.Having both type of script on page is type of cript recognition which further increases the complexity of the system.Urdu Text written in two different styles (multiscripts on one page).

Data Preparation
t printed in books and newspapers can be divided into two generations.The books printed before all hand written while majority of the books published after 1995 use computer generated Nastaliq h as Alvi Nastaliq or Noori Nastaliq.Shape of the characters in Urdu depends on the shape of the it follows.That's why we cannot take the character unit as the classification unit for the recognition.egmentation at character level is shown in components were generated synthetically.Even in 1200 primary com samples are less than the required number of samples.To complete scanned books samples.We have also trained Urdu numerals and roma Sample of the primary components are shown in Figure 13.We have also collected samples of the merged ligatures.Merged ligatu merge with the primary component shape.As merged characters or to recognition in any OCR system.After analyzing scanned images from components merged with the primary component shape.We ha

Features Extraction
For the classification of any pattern, relevant features have to be extracted.For Urdu OCR many researchers use different features.We have recognized primary and secondary components.For classification of the primary component of the ligature, we calculated DCT, Gabor, directional and gradient features.We have scaled all the images to 32 × 32 for the DCT.Whole image DCT gives us 1024 frequency components.DCT has one important property that left top of the DCT matrix gives high frequency component.High frequency component means maximum information about the image is stored on top left the DCT matrix.We have extracted total 100 features from the total of 1024 feature values in zigzag manner as shown in Figure 16.

Gabor Features
Gabor function G(i,j) is the linear filter.Rajneesh Rani et al. 11    merge with the primary component shape.As merged characters or touching character pose problem for the recognition in any OCR system.After analyzing scanned images from books, we have found some diacritics components merged with the primary component shape.We have collected total 41 such primary components merged with the existing primary components.Some of the diacritics touching primary components are shown in the Figure 14.We have collected150 samples for each secondary component from the scanned books.Sample of the secondary component is shown in the Figure 15: Complete statistics of training data is given in following Table 2.  components merged with the primary component shape.We have collected total 41 such primary components merged with the existing primary components.Some of the diacritics touching primary components are shown in the Figure 14.Complete statistics of training data is given in following Table 2.

Table 2. Training data statistics
Total Ligatures 10082 Figure 16.DCT coefficients of image selected in zigzag direction into one vector.

Gabor Features
Gabor function G(i,j) is the linear filter.Rajneesh Rani et al. 11  After calculating gradient vector, we calculated the magnitude and direction as given in equation.

Magnitude g x y g x y
x y = ( ) + ( ) The direction gradient vector is then decomposed along 8 chain code directions (D0, D1, D2, D3, D4, D5, D6 and D7) as shown in Figure 18.After that, the character image is divided into 9 × 9 blocks (81 blocks).If the gradient vector lies between two directions, then it is decomposed, else, its magnitude of the vector is retained.This results in 63 × 63 × 8 values.Next, the spatial resolution 9 × 9 is reduced to 5 × 5 for the down sampling of every two horizontal and vertical blocks with 5 × 5 Gaussian filter to get the 200 features value per image.

Primary Component Recognition
Lehal and Rana 6 reported 98.01% and 96.78% accuracy of the 2190 primary component classes using Support Vector Machine 14 (SVM) and k nearest neighbor respectively.We have found that out of 2190 primary component many primary shapes of the ligatures looks same.Therefore, our primary component count of the ligature is reduced to 1873.We have used DCT, Gabor, directional and gradient for the feature extraction of the primary component.We have used SVM (linear and polynomial kernel) and kNN classifier for the primary component recognition.Results primary component recognition with SVM classifier using linear and polynomial kernel having degree 3 and 4 is given in Table 3.
q is the orientation of sinusoidal plane wave, λ is the wavelength.s m and s n are the standard deviations.We have taken both the standard deviations equal of the feature extraction.
To calculate the feature of the input, first image is scaled to 32 × 32 pixels.Then it is further partitioned into four equal non overlapping sub regions of size 16 × 16.These sub regions are again further partitioned into 4 non overlapping sub-sub regions of size 8 × 8.After 8x8 sub region division we get total 16 small regions.These 21 images are then convolved with odd symmetric and even symmetric Gabor filters in nine different angles, of orientation q of 20 degrees, to obtain a feature vector of 189 values.

Directional Features
Directional features 12

Gradient Features
A gradient feature 12,13 calculates the magnitude and direction of the maximum changes in intensity in the neighborhood of the pixels.For the gradient features extraction, first image is normalized to 63 × 63 pixel sizes.After normalizing image, the gradient vector is calculated in both x and y direction at each pixel position using the sobel operator as shown in Figure 17.After normalizing image, the gradient vector is calculated in both x and y direction at each pixel position using the sobel operator as shown in figure 17.We also experimented with k nearest neighbor classifier with different values of k (1, 3 and 5).Result of the primary component recognition using kNN classifier is shown in Table 4.With increase in value of k we have observed that output goes to 98.30%.
We observed that due to large number of classes SVM takes average 559 seconds to recognize 3746 primary components of ligature where as kNN classifier takes only 170 seconds to recognize the same.

Secondary Component Recognition
We have 19 secondary component classes, for which, we have extracted DCT, Gabor and zoning features.
For training samples, we have 100 samples for each class.Table 5 shows the result of secondary component recognition with different features and classifiers.As we can see, the combination of DCT and polynomial SVM (degree = 3) classifier attains 99.50% recognition accuracy.
We also use kNN classifier for the secondary component recognition with different values of k (1 and 3) Experimental result of the secondary component recognition with kNN is given in Table 6.

Formation of Ligature
After getting the primary and secondary component, we form ligature from the grouping of these two  codes (primary and secondary component code).We manually crafted code book which comprises primary component code and secondary string code.From the combination of primary and secondary components code we extracted the ligature from the primary component code and the secondary string code.To search the combination of primary and secondary code, we implemented Binary Search Tree (BST).Nodes of the binary search tree contain the primary code and linked list of nodes having secondary code and their ligature code in Unicode.Binary search tree and nodes of the linked list of our code book structure is shown in the Figure 19.Let primary component classifier gives a code of 46 and the secondary string code is hB.Then BST search gives ligature Let primary component classifier gives a code of 46 and the secondary string code is hB.Then BST search gives ligature‫ﮔﯩﻮ‬ as the identified ligatures.

Conclusions and Future Scope
We have used DCT with linear kernel SVM for the primary component.For the secondary component recognition we used DCT features (feature vector length 100) and polynomial kernel SVM (Degree 3).We have tested our system on 110 pages Urdu and got 83% accuracy.Urdu images having no broken or merged primary or secondary component and no Naskh style Urdu text have accuracy nearly 90.29%.New methodology needs to be devised to handle broken or merged ligatures.Also to recognize the Naskh writing style on the same Urdu text page, Naskh recognition OCR and font identification needs to be developed.

Conclusions and Future Scope
We have used DCT with linear kernel SVM for the primary component.For the secondary component recognition we used DCT features (feature vector length 100) and polynomial kernel SVM (Degree 3).We have tested our system on 110 pages Urdu and got 83% accuracy.Urdu images having no broken or merged primary or secondary component and no Naskh style Urdu text have accuracy nearly 90.29%.New methodology needs to be devised to handle broken or merged ligatures.Also to recognize the Naskh writing style on the same Urdu text page, Naskh recognition OCR and font identification needs to be developed.

Figure 1 .
Figure 1.Red and Blue color text is in Nastaliq and Naskh writing style respectively.

Figure 4 .
Figure 4. Urdu writing style in nastalique from right to left and number written from left to righ  As Urdu is mostly written diagonally from right top to bottom left, there is problem of segm poses problem for the character and word segmentation.Figure 5 exhibit shows overlap marked as in red, green and blue color.

Figure 5
exhibit shows overlap marked as in red, green and blue color.

Figure 1 .
Figure 1.Red and Blue color text is in Nastaliq and Naskh writing style respectively.Urdu has 38 basic letters.Figure 2 shows all the basic letters of Urdu script.It is written from right to left whereas Urdu numerals are written as roman numerals i.e. from left to right.

Figure 2 .
Figure 2. Urdu alphabets and number.Urdu characters are classified as joiner and non-joiner.Those Ur character but not with the succeeding character termed as non-jo shown in Figure3.Rest of character exception these 12 non join character and change its shape.
development of OCR for Urdu Script involves many unique co  Urdu is written diagonally from right to left and top to bottom top right to bottom left direction as the different joiners are of complexity to the Urdu OCR.As observed from the bo Numerals written.It is found that Urdu and Roman numerals in Figure 4.

Figure 2 .
Figure 2. Urdu alphabets and number.Urdu characters are classified as joiner and non-joiner character but not with the succeeding character terme shown in Figure3.Rest of character exception these 1 character and change its shape.

Figure 4 .
Figure 4. Urdu writing style in nastalique from r  As Urdu is mostly written diagonally from right top

Figure 4 .
Figure 4. Urdu writing style in nastalique from right to left and number written from left to right.

Figure 4 .
Figure 4. Urdu writing style in nastalique from right to left and number written from left to righ  As Urdu is mostly written diagonally from right top to bottom left, there is problem of segm poses problem for the character and word segmentation.Figure 5 exhibit shows overlap marked as in red, green and blue color.
Figure 5 exhibit shows overlap marked as in red, green and blue color.

Figure 5 .
Figure 5. Ligature overlapping. Urdu is context sensitive script as well.Context sensitive means that the shape of the chara

Figure 7 .
Figure 7. Merged diacritics with ligature in Urdu text.

Figure 7 .
Figure 7. Merged diacritics with ligature in Urdu text.

Figure 10 .
Figure 10.Urdu Text written in two different styles (multiscripts on one page).

Figure 10 .
Figure 10.Urdu Text written in two different styles (multiscripts on one page).

Figure 8 .
Figure 8. Urdu text with very small space between ligatures.

Figure 7 .Figure 8 .
Figure 7. Merged diacritics with ligature in Urdu text  There is very little space in between words in Urdu.Different ligatures eit end as a last character.As shown in Figure 8, bar in green color shows t words where bar in red bar represent the space between two ligatures between ligature and words, we cannot find the word boundary between d

Figure 9 .
Figure 9. Two ligature in different lines get merged. It is observed in some Urdu books footer of the page is written in N complexity to the recognition.Having both Naskh and Nastliq writing style

Figure 9 .
Figure 9. Two ligature in different lines get merged

Figure 7 .
Figure 7. Merged diacritics with ligature in Urdu

Figure 8 .
Figure 8. Urdu text with very small space between ligatures.

Figure 9 .
Figure 9. Two ligature in different lines get merged. It is observed in some Urdu books footer of the page is written complexity to the recognition.Having both Naskh and Nastliq writing s

5 Figure 10 .
Figure 10.Urdu Text written in two different styles (multiscripts on one page).

Figure 11 .
Figure 11.Urdu word with character segmentation.For the development of Urdu OCR we have taken ligatures as a classification unit.Ligature is the connected component and has different Urdu characters and end character is either non joiner or space.Urdu words can have more than one ligature.As for example, the word ‫)ﺑﺎﺩﺷﺎﻩ(‬ (badshah) is composed of four ligatures: two ligatures having multiple characters are ‫ﺑﺎ(‬ and ‫)ﺷﺎ‬ and have two single character ligature ‫ﺩ(‬ and ‫.)ﻩ‬The two ligatures with multiple character are further composed of two characters each ‫+ﺏ(‬ ‫ﺍ‬ = ‫ﺑﺎ‬ and ‫+ﺵ‬ ‫ﺍ‬ = ‫)ﺷﺎ‬ Lehal10 did the statistical analysis of the recognizable unit for Urdu OCR.He has taken 6,533,057 words corpus and identified 25,957 unique ligatures.He identified nearly 10082 ligatures which are used 99% time in the whole corpus.

Figure 12 .
(a) Urdu Ligature (b) Ligature primary component (c) Ligature secondary component.After removing diacritics from the 10082ligatures and grouping ligatures having same primary component, we have 1845 classes of the primary component and 16 secondary components.When these secondary components are used along with 1845 primary component we get a total of 10082ligatures.We have gathered training data from various scanned books.Some of the least frequent ligatures, as mentioned byLehal   10   , rarely occur in some books.To generate the training data for those primary components of the ligature, we made some synthetic images with different font size i.e. 35, 38, 40, 45, 50, 55 and different format option like bold or regular.We have removed diacritics marks from these ligatures to get the primary component.We have gathered total 1200 primary component from the scanned books and rest of the primary andJournalPage 5

Figure 10 .
Figure 10.Urdu Text written in two different styles (multiscripts on one page).

Figure 11 .
Figure 11.Urdu word with character segmentation.For the development of Urdu OCR we have taken ligatures as a classification unit.Ligature is the connected component and has different Urdu characters and end character is either non joiner or space.Urdu words can have more than one ligature.As for example, the word ‫)ﺑﺎﺩﺷﺎﻩ(‬ (badshah) is composed of four ligatures: two ligatures having multiple characters are ‫ﺑﺎ(‬ and ‫)ﺷﺎ‬ and have two single character ligature ‫ﺩ(‬ and ‫.)ﻩ‬The two ligatures with multiple character are further composed of two characters each ‫+ﺏ(‬ ‫ﺍ‬ = ‫ﺑﺎ‬ and ‫+ﺵ‬ ‫ﺍ‬ = ‫)ﺷﺎ‬ Lehal10 did the statistical analysis of the recognizable unit for Urdu OCR.He has taken 6,533,057 words corpus and identified 25,957 unique ligatures.He identified nearly 10082 ligatures which are used 99% time in the whole corpus.
page).n be divided into two generations.The books printed before books published after 1995 use computer generated Nastaliq hape of the characters in Urdu depends on the shape of the the character unit as the classification unit for the recognition.n in Figure11.ord with character segmentation.en ligatures as a classification unit.Ligature is the connected nd end character is either non joiner or space.Urdu words can the word ‫)ﺑﺎﺩﺷﺎﻩ(‬ (badshah) is composed of four ligatures: two ‫)ﺷﺎ‬ and have two single character ligature ‫ﺩ(‬ and ‫.)ﻩ‬The two posed of two characters each ‫+ﺏ(‬ ‫ﺍ‬ = ‫ﺑﺎ‬ and ‫+ﺵ‬ ‫ﺍ‬ = ‫)ﺷﺎ‬ zable unit for Urdu OCR.He has taken 6,533,057 words corpus tified nearly 10082 ligatures which are used 99% time in the hese 10082 ligatures.But to collect the training data for 10082 sk.To decrease the number of recognition classes, Lehal 10 has omponent as shown in Figure 12.(b) (c) primary component (c) Ligature secondary component.es and grouping ligatures having same primary component, we nt and 16 secondary components.When these secondary omponent we get a total of 10082ligatures.s scanned books.Some of the least frequent ligatures, as s.To generate the training data for those primary components with different font size i.e. 35, 38, 40, 45, 50, 55 and different oved diacritics marks from these ligatures to get the primary ry component from the scanned books and rest of the primary and Page 5 ultiscripts on one page).n be divided into two generations.The books printed before books published after 1995 use computer generated Nastaliq hape of the characters in Urdu depends on the shape of the the character unit as the classification unit for the recognition.n in Figure 11.ord with character segmentation.ken ligatures as a classification unit.Ligature is the connected nd end character is either non joiner or space.Urdu words can the word ‫)ﺑﺎﺩﺷﺎﻩ(‬ (badshah) is composed of four ligatures: two ‫)ﺷﺎ‬ and have two single character ligature ‫ﺩ(‬ and ‫.)ﻩ‬The two posed of two characters each ‫+ﺏ(‬ ‫ﺍ‬ = ‫ﺑﺎ‬ and ‫+ﺵ‬ ‫ﺍ‬ = ‫)ﺷﺎ‬ zable unit for Urdu OCR.He has taken 6,533,057 words corpus tified nearly 10082 ligatures which are used 99% time in the these 10082 ligatures.But to collect the training data for 10082 sk.To decrease the number of recognition classes, Lehal 10 has omponent as shown in Figure 12.(b) (c) primary component (c) Ligature secondary component.es and grouping ligatures having same primary component, we nt and 16 secondary components.When these secondary omponent we get a total of 10082ligatures.s scanned books.Some of the least frequent ligatures, as ks.To generate the training data for those primary components s with different font size i.e. 35, 38, 40, 45, 50, 55 and different oved diacritics marks from these ligatures to get the primary ry component from the scanned books and rest of the primary ).The two ligatures with multiple character are further composed of two characters each ( Page 5 ultiscripts on one page).n be divided into two generations.The books printed before books published after 1995 use computer generated Nastaliq hape of the characters in Urdu depends on the shape of the the character unit as the classification unit for the recognition.n in Figure 11.ord with character segmentation.en ligatures as a classification unit.Ligature is the connected nd end character is either non joiner or space.Urdu words can the word ‫)ﺑﺎﺩﺷﺎﻩ(‬ (badshah) is composed of four ligatures: two ‫)ﺷﺎ‬ and have two single character ligature ‫ﺩ(‬ and ‫.)ﻩ‬The two posed of two characters each ‫+ﺏ(‬ ‫ﺍ‬ = ‫ﺑﺎ‬ and ‫+ﺵ‬ ‫ﺍ‬ = ‫)ﺷﺎ‬ zable unit for Urdu OCR.He has taken 6,533,057 words corpus tified nearly 10082 ligatures which are used 99% time in the hese 10082 ligatures.But to collect the training data for 10082 sk.To decrease the number of recognition classes, Lehal 10 has omponent as shown in Figure 12.(b) (c) primary component (c) Ligature secondary component.es and grouping ligatures having same primary component, we nt and 16 secondary components.When these secondary omponent we get a total of 10082ligatures.s scanned books.Some of the least frequent ligatures, as s.To generate the training data for those primary components with different font size i.e. 35, 38, 40, 45, 50, 55 and different oved diacritics marks from these ligatures to get the primary ry component from the scanned books and rest of the primary and Page 5 ultiscripts on one page).n be divided into two generations.The books printed before books published after 1995 use computer generated Nastaliq hape of the characters in Urdu depends on the shape of the the character unit as the classification unit for the recognition.n in Figure 11.ord with character segmentation.en ligatures as a classification unit.Ligature is the connected nd end character is either non joiner or space.Urdu words can the word ‫)ﺑﺎﺩﺷﺎﻩ(‬ (badshah) is composed of four ligatures: two ‫)ﺷﺎ‬ and have two single character ligature ‫ﺩ(‬ and ‫.)ﻩ‬The two posed of two characters each ‫+ﺏ(‬ ‫ﺍ‬ = ‫ﺑﺎ‬ and ‫+ﺵ‬ ‫ﺍ‬ = ‫)ﺷﺎ‬ zable unit for Urdu OCR.He has taken 6,533,057 words corpus tified nearly 10082 ligatures which are used 99% time in the hese 10082 ligatures.But to collect the training data for 10082 sk.To decrease the number of recognition classes, Lehal 10 has omponent as shown in Figure 12.(b) (c) primary component (c) Ligature secondary component.es and grouping ligatures having same primary component, we nt and 16 secondary components.When these secondary omponent we get a total of 10082ligatures.s scanned books.Some of the least frequent ligatures, as s.To generate the training data for those primary components with different font size i.e. 35, 38, 40, 45, 50, 55 and different oved diacritics marks from these ligatures to get the primary ry component from the scanned books and rest of the primary ) Lehal 10 did the statistical analysis of the recognizable unit for Urdu OCR.He has taken 6,533,057 words corpus and identified 25,957 unique ligatures.He identified nearly 10082 ligatures which are used 99% time in the whole corpus.

Figure 11 .=Figure 11 .Figure 13 .
Figure 11.Urdu word with character segmentation.evelopment of Urdu OCR we have taken ligatures as a classification unit.Ligature is the connected nt and has different Urdu characters and end character is either non joiner or space.Urdu words can e than one ligature.As for example, the word ‫)ﺑﺎﺩﺷﺎﻩ(‬ (badshah) is composed of four ligatures: two having multiple characters are ‫ﺑﺎ(‬ and ‫)ﺷﺎ‬ and have two single character ligature ‫ﺩ(‬ and ‫.)ﻩ‬The two with multiple character are further composed of two characters each ‫+ﺏ(‬ ‫ﺍ‬ = ‫ﺑﺎ‬ and ‫+ﺵ‬ ‫ﺍ‬ = ‫)ﺷﺎ‬ d the statistical analysis of the recognizable unit for Urdu OCR.He has taken 6,533,057 words corpus tified 25,957 unique ligatures.He identified nearly 10082 ligatures which are used 99% time in the rpus.developed our proposed system with these 10082 ligatures.But to collect the training data for 10082 from the books is very cumbersome task.To decrease the number of recognition classes, Lehal 10 has ligature into primary and secondary component as shown in Figure 12.

Figure 13 .
Figure 13.Primary component training sample data.

5. 1
Discrete Cosine Transformation (DCT) DCT is the statistical feature extraction technique.DCT maps ligature image from spatial domain to the frequency domain.DCT maps the entire high frequency component to the upper right corner of the image matrix and low frequency components maps to the bottom right corner of the image.DCT coefficients f(p, q) of image I(m, n) are computed by equation(1) : used gabor features for the script identification.It is used for the edge detection in image processing.It is multiplication of harmonic function and Gaussian function.G m n P m n C m n ( , ) ( , ) ( , ) = *This is used both for the orientation and spatial frequency.A Gabor filter is defined as

Figure 14 .
Figure 14.Sample of merged primary component with secondary component.

Figure 14 .
Figure 14.Sample of merged primary component with secondary component.

Figure 15 .
Figure 15.Secondary component training sample data.

Figure 15 .
Figure 15.Secondary component training sample data.

Figure 14 .
Figure 14.Sample of merged primary component with secondary component.We have collected150 samples for each secondary component from the scanned books.Sample of the secondary component is shown in the Figure15:

Figure 15 .
Figure 15.Secondary component training sample data.

Figure 16 .
Figure 16.DCT coefficients of image selected in zigzag direction into one vector.

Figure 17 .Figure 17 .
Figure 17.Sobel operator.After calculating gradient vector, we calculated the magnitude and direction as given in equation.

Table 1 .
Different Shape of tey, tay and meem with different other characters

Table 2 .
Training data statistics

Table 2 .
Training data statistics calculated the distance of black and white pixels in eight different directions for each pixel.We have scaled our input image to 36 × 36 pixels.After scaling, directional features vector is calculated in eight different direction i.e. 0 o , 45 o , 90 o , 135 o , 180 o , 225 o , 270 o and 315 o .It gives directional feature vector of length 16 for each pixel.To down sample 20736 (16 * 36 * 36) directional features value, we divided our image into 9 (3 × 3) zones.Then we have generated 16 feature vector lengths from each zone by adding corresponding directional feature vector values of all the pixels in that zone.So, we have obtained directional feature vector of length 144 i.e. 16 feature values * 9 zones.

Table 3 .
Primary component of ligature recognition using SVM classifier

Table 4 .
Primary component of ligature recognition using kNN classifier

Table 5 .
Secondary component of ligature recognition using SVM classifier

Table 6 .
Secondary component of ligature recognition using kNN classifierOffline Urdu OCR using Ligature based Segmentation for Nastaliq Script Indian Journal of Science and Technology 8 Vol 8 (35) | December 2015 | www.indjst.org