Design and Implementation of Shahmukhi Spell Checker

Spellchecker is a software tool that identifies and corrects any spelling mistakes in a text document. Designing a spell checker for Punjabi language is a challenging task. Punjabi language can be written in two scripts, Gurmukhi script (a Left to Right script based on Devanagari) and Perso-Arabic Script (a Right to Left script) which is also referred as Shahmukhi. Gurmukhi script follow ‘one sound one symbol’ principle where as Shahmukhi follows ‘one sound multiple symbol’ principle. Thus making Shahmukhi text even more challenging which complicates the design of spell checker for Shahmukhi text. The text written in Shahmukhi normally does not have short vowels and diacritic marks. So missing some of diacritic marks should not be considered as a mistake. But for Holy books like Quran, missing diacritic marks are considered as a mistake. So spell checker is designed in such a way that it can spell check with and without diacritic marks compulsion, which depends on user’s selection to spell check. In addition to this, Shahmukhi text has complex grammatical rules and phonetic properties. Thus it needs different algorithms and techniques for expected efficiency. This paper presents the complete design and implementation of a spell checker for Shahmukhi text. Design and Implementation of Shahmukhi Spell Checker Kawarbir Singh Dhanju1*, Gurpreet Singh Lehal1, Tejinder Singh Saini2 and Arshdeep Kaur1 DCS, Punjabi University, Patiala 147 002, Punjab, India; kbs.dhanju@gmail.com, gslehal@gmail.com, akaur448@yahoo.com ACTDPL, Punjabi University, Patiala 147 002, Punjab, India tej@pbi.ac.in


Introduction
Spellchecker is a software tool that identifies and corrects any spelling mistakes in a text by checking the spellings of the words in a document, validate them i.e. checks whether they are right or wrongly spelled and in case the spell checker has doubts about the spelling of the word, it suggests possible alternatives.
The main steps performed by the spell checker are: • Input Shahmukhi words from user document.
• Pre-process the words.
• Detect the erroneous word by searching it from the dictionary.• In case, the word is erroneous, suggest possible alternatives.
Even though it looks simple but to write Punjabi in Shahmukhi script is complex than other languages such as English, Hindi.Thus existing algorithms and techniques are not suitable for the design of spell checker for Shahmukhi script.

Brief Description of Punjabi Language
Punjabi language is 10th most widely spoken language in the world.It is spoken by 102 million speakers worldwide 12 .It is the native language of the Punjabi people who inhabit the historical Punjab region of Pakistan and India.Punjabi can be written in two Scripts Gurmukhi and Shahmukhi.Gurmukhi is used to write Punjabi in India and Shahmukhi is used to write Punjabi in Pakistan.Shahmukhi is basically Punjabi text written is Perso-Arabic Script (a Right to Left script).Shahmukhi text has complex grammatical rules and phonetic properties.
In Pakistan Punjabi written in Shahmukhi script is not an official language so very little support and resources are available for Shahmukhi script.In fact this is the first time a spell checker support for Shahmukhi text has been designed and implemented.

Shahmukhi Script
The meaning of "Shahmukhi" is "from the King's mouth" [1][2][3]10 . TheShahmukhi text was first used by the Sufi poets of the Punjab, and then Muslim populace in • Shahmukhi text is written in Nastaleeq style and from right to left, a highly complex writing system that is cursive and context-sensitive.It has 49 common and 6 rare consonants, 16 diacritical marks or vowels, etc. • Consonants can be further subdivided into two groups: aspirated and non-aspirated consonants.
In Shahmukhi, aspirated consonants are represented by the combination of a consonant (to be aspirated) and HEH-DO CHASHMEE.The remaining six aspirated consonants are: In case of non-aspirated consonants, Shahmukhi has more consonants than Gurmukhi, which follows the one symbol for one sound principle.On the other hand there are more than one characters for a single sound in Shahmukhi.Diacritics are used to specify the vowels.In Shahmukhi, there are five long vowels.

Unicode Vowel Name
And three short vowels: According to Analysis, below diacritics are considered to be optional:

Shahmukhi Numerals
Shahmukhi characters can be divided into two groups, non-joiners and joiners 1 .The non-joiners can acquire only isolated and final shape and do not join with the next character On the contrary; joiners can acquire all the four shapes and get merged with the next following character.A group of joiners and/or non-joiner joined together form a ligature.A word in Urdu is a collection of one or more ligatures.The isolated form of joiners and non-joiners is shown in Tables 8 and 9.

Error Pattern in Shahmukhi Text
Shahmukhi text has complex grammatical rules and phonetic properties which makes Shahmukhi text open to different types of mistakes.The following error patterns were observed in Shahmukhi text:

Multiple Characters with Same Sound (Phonetic Nature)
In Shahmukhi script, there is more than one letter for single sound; some sounds have 5 to 6 letters, which is the major reason for spelling mistakes.Some of the examples are shown below.
One of the most common type of error in Shahmukhi text is that "gol he(‫")ہ‬ is used at the end of word to produces sound of "a", but mostly the user misspelled it with ‫ا‬ (alef).For Example, ‫ہکیرما‬ /ɘmrikh/=>‫/اکیرما‬ɘmrikɘ/ Characters with Similar Shapes: -In Shahmukhi script, the characters such as given below, have same shapes and thus are reason for the misspelled words.

Characters with Zero Width
In Shahmukhi script, the characters such as given below in the Table 12, have zero width and so if by mistake a user makes multiple entries of such characters only a single entry is visible.If the spell checker flags such word as misspelled the user will not come to know where the error exist.This problem is also considered as Visual Error as well as Dual Diacritic Error.For example, consider the word, It has two "pesh" diacritic mark but visually the word looks correct but internally it has stored wrongly and the user will not be aware where the error lies.

Visual Errors due to Nastaleeq Style
As Shahmukhi is written in Nastaleeq Style, so sometime, when number of joiners (letter type-Joiner and Non-Joiner) gets combined to form the word, the diacritic marks is not visible to the users which might be having some mistake.Such problems are considered as Visual Error.As in the Example This word has 3 Joiners ‫ک(‬ ‫,)ے,ھ,‬ so tasdid (ّ ) is not visible when the letters are joined.
In this word tasdid (ّ ) is missing which is considered as a mistake when diacritic marks are compulsory.

Optional Diacritics
The problem which is typically related to Shahmukhi Script is that Short Vowels and diacritic marks are not compulsory to write.So if the word having two optional diacritics, the user may lose first diacritic, second diacritic or both or even both the diacritic marks may be considered.Thus, a single word with two diacritics have its four variations.All the cases have to be considered for spell checking.
For example, /ulfat/‫ا‬ ُ ‫فل‬ َ ‫ت‬word can be written as ‫فلا‬ َ ‫ت‬ or ‫تفلا‬ or ‫ا‬ ُ ‫,تفل‬ all these variations are correct and has to be considered.

Presence of Izafat
Izafat are words in which two valid words are connected like: ‫لباق‬ ِ ‫رابتعا‬ (kabil-i-aitbar) and its meaning is words are connected like: diacritics, the user may lose first diacritic, second diacritic or both or even both the diacritic marks may be considered any two words can be connected to form Izafat.Some of the examples of Izafat are:

Lexicon Creation
The first step in development of the spell checker is the creation of a lexicon of correctly spelled words, which will be used by the spell checker to check the spellings as well as generate the suggestions.Various Techniques has been used to create the Lexicon for different Spell Checker such as some of them are given below: • In Bangla Spell Checker 4 , phonetically similar characters are mapped into the single unit of character code.So the user input is checked using that character code.• In Malayalam Spell Checker 5 , "Rule cum Dictionary" based Approach is used, where it stores the root word in dictionary and user input is checked by deriving the root word using the Morphological Analyzer and Morphological Generator.• In Oriya Spell Checker 6 , the words in the dictionary are stored according to the length of the word for effective search.Only root words are stored in the dictionary, so root word is obtained from the user input by using Morphological Analyzer and this root word is then checked from the dictionary.From the above observation, there are two issues involved in lexicon development: • Size of the lexicon.
• Format of the words in lexicon.

Size of the Lexicon
It is observed that there are two approaches can be followed for storing the lexicon 8 .The first approach stores the root words of a language and the rest of the words are derived from these root words like Oriya spell checker.The other approach is to store all the possible words of the language in the lexicon.We have followed the second approach and stored all the possible forms of words of Shahmukhi words in the lexicon.
In Shahmukhi Script, as Optional Diacritics discussed in Error Pattern, a single word with two diacritics have its four variations and all the cases have to be considered for spell checking.For the consideration of all the cases, words having all the diacritic marks is stored in the lexicon.
For example, ‫ا‬ ُ ‫فل‬ َ ‫/ت‬ulfat/ having two optional diacritic have four variations ), but in lexicon ‫ا‬ ُ ‫فل‬ َ ‫ت‬ word (word having both diacritic marks) is stored so that spell check and suggestion generation of wrong word is possible while Spell Checker is executing in e considered for as well as having both diacritic marks) is stored so that spell check and suggestion generation of wrong word is possible while Spell Checker is executing in "With diacritic" spell checking.For the conside identified and stored in the database.
In Lexicon, each word is given a Phonetic code according to Soundex Approach.Phonetic code itself acts as index key to all the words having same phonetic code.
Furthermore, the words are arranged according to the size of phonetic index.The advantage of this approach is Phonetic code, Word length base dictionary division and Hash Table of above discussed Spell Checkers are included to make it more powerful.The number of keys in each of these hash tables is shown in Figure 1.

Format of the Words
In Unicode, there are more than one code points for a single letter.For example, ‫آ‬ (alef-madda) letter can be written by a single key (that means single Unicode 0622) and ‫ا‬ + ٓ , two keys (that means two Unicode combined 0627 + 0653, correspond to a single letter).It was necessary to normalize the text in lexicon for storing so that the order of letters in the word stored in lexicon and that of user entered for spell checking is always same.Therefore the Normalization Form C (NFC) is used for storing the lexicon.
For an example, as discussed above is a word ‫مارآ‬ (Aaram) ‫م‬ + ‫ا‬ + ‫ر‬ + ‫مارآ=آ‬ Now if the user write that word by using different keys, like,

‫م‬ + ‫ا‬ + ‫+ر‬ ٓ + ‫ا‬ = ‫مارآ‬
Then it is normalized to the above form so that it can be compared with the Lexicon.

Spell Checker Architecture
As proposed by many other researchers [4][5][6][7][8] .The major components of the spell checker architecture are shown in Figure 2. The basic modules are: Pre-processing Module (which consists of Tokenization, Normalization, Remove Optional Diacritics and Code Generation), Lexicon Look-Up/Error Detection Module and Error Correction/ Suggestion Generation Module.

Pre-Processing Module
This module pre-process the user text, so that it can be formalized into the predefined format of the lexicon.This module performs the following steps:

Tokenization
Tokenization is the process to break the block of text into a list of words.The text is broken with the help of some boundary delimiter and blank spaces.The boundary delimiters here are several punctuation marks.As Unicode is standardized for Shahmukhi script, so all boundary delimiter like punctuation marks, blank space etc. is considered for tokenization.

Word Normalization
The tokens are then made to pass through a normalization process to convert them to the format in which the lexicon has been stored.We have used CNF for this purpose.The purpose of normalization can be shown from the following example: This word can be written in multiple forms such as: ُ ‫,آ‬ Thus, the word ‫ہ‬ َ ‫ؤ‬ ُ ‫آ‬ typed in any format will be normalized as { ‫,آ‬ ُ , ‫,ؤ‬ َ , ‫ہ‬ }.

Remove Diacritics
The normalized tokens are then passed through Remove Diacritics phase, in which the optional diacritics (as shown in Table 5) are removed from the normalized token.The purpose of this phase is to achieve a word of constant length which can be searched in a single dictionary of constant length.For an example,

‫آؤہ‬ ‫ہ→‬ َ ‫ؤ‬ ُ ‫آ‬ (after removing diacritics)
As this word is of constant length 'three' , so it needs comparison with a dictionary of words having length 'three' only.Similarly, ‫ا‬ ُ ‫فل‬ َ ‫تفلا→ت‬ (after removing diacritics) Here ‫تفلا‬ has 'four' characters so it will be compared with the dictionary of length 'four' .diacritics.If match occurs then control passes to next token.
But if mismatch occurs due to other reasons like use of wrong diacritic or diacritic is placed at the wrong position, then it will considered as a miss and Error correction Module will start.
As an example, considering a wrong word ‫ؤہ‬ ُ َ ‫آ‬ from previous example, here "َ "(zabar) is placed at wrong position (after "pesh" instead of after "gol he").So, when we compare wrong word ‫ُؤہ‬ َ ‫آ‬ with right word ‫ہ‬ َ ‫ؤ‬ ُ ‫آ‬ (from Table 15), here mismatch occurs and then this word is passed to the Error Correction Module.Using this approach we can easily handle Visual Error, Dual diacritic error and diacritic displacement Error.
Similarly, consider a word ‫ؤہ‬ ُ ‫,آ‬ here "َ " (zabar) is missing from the word in lexicon, so when we compare both words, it will not be considered as an error as the difference lies in optional diacritics and control passes to the next token.

Search with Diacritics
In case of Search with Diacritics, if the token does not matches with any word in the list correspond to the phonetic code, then Error correction phase starts.As an example, consider a word ‫ؤہ‬ ُ ‫,آ‬ here "َ " (zabar) is not optional from the word in lexicon, so when we compare both words, it will be considered as an error as the difference lies in diacritics and in this phase, all the diacritics are compulsory.So wrong word passes to the Error Correction Module.

Advantages of using Phonetic Dictionary in Error Detection Module
Once the system has detected an erroneous word, all the words in the list we found earlier using the phonetic code, will be considered as a suggestion for the current user token as this list contain all the words that have either diacritics or phonetic differences.So at this stage while detecting error we are provided with the suggestions of most commonly occurring errors (i.e.diacritic errors and phonetic errors), which, are later feed to Ranking phase of Error Correction Module.
\For an example, In case of wrong word ‫ؤہ‬ ُ َ ‫آ‬ , the

Error Correction Module
Once the Error Detection Module has detected an erroneous word, the erroneous word along with the previous and next word are passed to the Error correction module.
Error correction Module performs the following steps: • Suggestion Generation.
• Ranking of Suggestions.
We have used following approach for Suggestion Generation.
Advanced Reverse Minimum Edit Distance Approach using bi-gram to find suggestions for the wrong word.

Reverse Minimum Edit Distance Approach
We have used the reverse minimum edit distance approach to generate the primary suggestion list 8 .We generate suggestions from the wrong word by supposing Errors like For example, ‫ا‬ ُ ‫→ف‬ ‫(فا‬uf → af as a+u=u) here ُ is the missing diacritic.These errors also give rise to real word errors.For example,‫ا→سا‬ ِ ‫س‬ (as the mis‫ا→سا‬ ُ ‫س‬ (as the m • Run-On Error: When there is space missing between two or more valid words.For example, ‫ا‬ ُ ‫تتس‬ ‫ا→ینرک‬ ُ ‫ینرکتتس‬ (ustat karni ords).
• Split Word Error: This is opposite of Run-on error when there is some extra space is inserted between parts of a word.The error can be removed by removing the extra space.For example, ‫ا‬ ُ ‫اتت→تتس‬ ُ ‫(س‬ustat -> us tat)

Advantages of using Phonetic Dictionary in Suggestion Generation Phase
We generate suggestions by inserting combination of errors ourselves in the wrong Token's Phonetic Code.Here every combination is tried but Our Combination span reduces to its 1/3rd almost because of: • Phonetic Code: As each Symbol generally stands for three characters or more (Explained in Table 11).So by substituting single phonetic symbol we are checking for all phonetic characters corresponding to that symbol in a single match.
For e.g.consider a ‫رب"ربکا"‪word‬‬ e.g.conside code AKBR.If we insert phonetic symbol S in AKBR i.e.ASKBR then searching for ASKBR in hash table will be equal to searching of all form of (A= 5 characters, S=5 characters, K=4 characters, B=2 characters, R=3 characters) in a single comparison.

Advanced Reverse Minimum Edit Distance Approach using Bi-gram
We have reduced the comparisons of Reverse Minimum Edit Distance Approach by using Bi-gram.We had already found the possibility of occurrence of character after another character in each dictionary (dictionary is divided on length bases).By this we can find out the possible positions where error could have existed.Therefore it minimizes the positions where the symbols can be inserted, substituted, deleted or Trans-positioned.If there is no possibility of occurrence of one character after the other character in user Token, then that combination is tried for reverse minimum Edit Distance Approach and the loop for all other combinations will not be tried whereas if all the bi-grams of user token exist, then every combination of Reverse Edit Distance Approach is tried.
For e.g.consider a word ‫"ررکا"‬ having Phonetic code AKRR.The bi-gram for AKRR are: AK, KR, and RR.If we know that bi-gram 'RR' does not exist in dictionary of length 4. So either of the R can be substituted.No other symbol (i.e.A, K) can be substituted, as substituting either A or K will lead to existence of 'RR' in Phonetic Code and

Ranking of Suggestions
Once the suggestion list has been generated, each suggestion is given weight according to the results of error analysis for Shahmukhi script carried out in detection of error patterns.According to the analysis following type of weights are assigned to the suggestions.Weightage according to Frequency: • The Results are also refined according to the frequency of their occurrence.It helps in rearranging the suggestions where a single type of error has occurred.
Weightage according to the location of Errors: • Errors that occur at the end of the word (Token) has more weightage as compare to the error that are at beginning of the word (Token).

Test Words Preparation
We used most commonly mis-spelled words to analyze the performance of the spell checker.The words were drawn from several sources: • Online Shahmukhi Newspapers.

Test Results
• Error Analysis if Diacritics is not compulsory.
• Error Analysis if Diacritics is Compulsory • General Error Analysis.

Conclusion
This is the first time that a spell checker for Shahmukhi Script has been designed and implemented.The spell checker is part of the Shahmukhi word processor.We have only taken care of non-real word errors.Detection and correction of real word errors and Izafat is a subject of further research.

Table 10 .
Characters having similar phonetic Code

Table 12 .
Characters having zero width In Assamese spell checker 7 , Hash Table has been used as lexicon look-up data structure.The correct Assamese words are stored into the hash table.The user input is directly checked by dictionary search technique.
Vol 8 (27) | October 2015 | www.indjst.org• During program execution the words are loaded into Hash table and nineteen different Hash tables are used for different Phonetic code length.The key of the Hash Table is the Phonetic code and value of the Hash Table is the list of all possible words correspond to that Phonetic code.

Table 13 .
Some entries in Hash table of length 3

Table of Different Word Length. Figure 1.
Dhanju: Keys in Hash Table of Different Word Length.
Table 15 is directly passed to the Ranking Phase of Error Correction Module as all the words in Table 15 have either diacritic or phonetic differences.

Table 17 .
Run-On Error and Split Error are the most common errors that occur due to presence of Non-joiners and typing Errors.• Substitution Error is the next most common Error that occur due to substitution of wrong character.Percentage of occurrence of Error in Optional Diacritic case Percentage of occurrence of Error in Compulsory Diacritic case •

Table 18 .
Percentage of occurrence of Error in Optional as well as Compulsory Diacritic case • Insertion Error and Deletion Error has less probability of occurrence than above discussed errors.• Transposition Error has lowest weightage according to survey but it is quite helpful in some special cases.