Arabic Dialects (AD) or daily language differs from MSA especially in social media communication. However, most Arabic social media texts have mixed forms and many variations especially between MSA and AD. Figure 2.3 illustrates the AD distribution.
There is a possible division of regional language within the six regional groups, as follows: Egyptian, Levantine, Gulf, Iraqi, Maghrebi, and others, as shown in Figure 2.4.
Dialect identification is closely related to the language identification problem. The dialect identification task attempts to identify the spoken dialect from within a set of texts that use the same character set in a known language.
Due to the similarity of dialects within a language, dialect identification is more difficult than language identification. Machine learning approaches and language models which are used for language identification need to be adapted for dialect identification as well.
Several projects on NLP for MSA have been carried out, but research on Dialectal Arabic NLP is in early stages [Habash, 2010].
When processing Arabic for the purposes of social media analysis, the first step is to identify the dialect and then map the dialect to MSA, because there is a lack of resources and tools for Dialectal Arabic NLP. We can therefore use MSA tools and resources after mapping the dialect to MSA.
Figure 2.3: Arabic dialects distribution and variation across Asia and Africa [Sadat et al., 2014a].
Figure 2.4: Division of Arabic dialects in six groups/divisions [Sadat et al., 2014a].
Diab et al. [2010] have run the COLABA project, a major effort to create resources and processing tools for Dialectal Arabic blogs. They used the BAMA and MAGEAD morphological analyzers. This project focused on four dialects: Egyptian, Iraqi, Levantine, and Moroccan.
Several tools for MSA regarding text processing—BAMA, MAGED, and MADA—will now be described briefly.
BAMA (Buckwalter Arabic Morphological Analyzer) provides morphological annotation for MSA. The BAMA database contains three tables of Arabic stems, complex prefixes, and complex suffixes and three additional tables used for controlling prefix-stem, stem-suffix, and prefix-suffix combinations [Buckwalter, 2004].
MAGEAD is a morphological analyzer and generator for the Arabic languages including MSA and the spoken dialects of Arabic. MAGEAD is modified to analyze the Levantine dialect [Habash and Rambow, 2006].
MADA+TOKEN is a toolkit for morphological analysis and disambiguation for the Arabic language that includes Arabic tokenization, discretization, disambiguation, POS tagging, stemming, and lemmatization. MADA selects the best analysis result within all possible analyses for each word in the current context by using SVM models classifying into 19 weighted morphological features. The selected analyses carry complete diacritic, lexemic, glossary, and morphological information. TOKEN takes the information provided by MADA to generate tokenized output in a wide variety of customizable formats. MADA depends on three resources: BAMA, the SRILM toolkit, and SVMTools [Habash et al., 2009].
Going back to the problem of AD identification, we give here a detailed example, with results. Sadat et al. [2014c] provided a framework for AD classification using probabilistic models across social media datasets. They incorporated the two popular techniques for language identification: the character n-gram Markov language model and Naïve Bayes classifiers.35
The Markov model calculates the probability that an input text is derived from a given language model built from training data [Dunning, 1994]. This model enables the computation of the probability P(S) or likelihood, of a sentence S, by using the following chain formula in the following equation:
The sequence (w1, w2, …, wn) represents the sequence of characters in a sentence S. P (wi | w1, …, wi–1) represents the probability of the character wi given the sequence w1, …wi–1.
A Naïve Bayes classifier is a simple probabilistic classifier based on applying Bayes’ theorem with strong (naïve) independence assumptions. In text classification, this classifier assigns the most likely category or class to a given document d from a set of pre-defined N classes as c1, c2, …, cN. The classification function f maps a document to a category (f : D → C) by maximizing the probability of the following equation [Peng and Schuurmans, 2003]:
where d and c denote the document and the category, respectively. In text classification a document d can be represented by a vector of T attributes d = (t1, t2, … tT). Assuming that all attributes ti are independent given the category c, we can calculate P(d|c) with the following equation:
Figure 2.5: Accuracies on the character-based n-gram Markov language models for 18 countries [Sadat et al., 2014a].
The attribute term ti can be a vocabulary term, local n-gram, word average length, or a global syntactic and semantic property [Peng and Schuurmans, 2003].
Sadat et al. [2014c] presented a set of experiments using these techniques with detailed examination of what models perform best under different conditions in a social media context. Experimental results showed that the Naïve Bayes classifier based on character bigrams can identify the 18 different Arabic dialects considered with an overall accuracy of 98%. The dataset used in the experiments was manually collected from forums and blogs, for each of the 18 dialects.
To look at the problem in more detail, Sadat et al. [2014a] applied both the n-gram Markov language model and the Naïve Bayes classifier to classify the eighteen Arabic dialects. The results of this study for the n-gram Markov language model is represented in Figure 2.5. This figure shows that the character-based unigram distribution helps the identification of two dialects, the Mauritanian and the Moroccan with an overall F-measure of 60% and an overall accuracy of 96%. Furthermore, the bigram distribution of 2 characters affix helps recognize 4 dialects, the Mauritanian, Moroccan, Tunisian, and Qatari, with an overall F-measure of 70% and overall accuracy of 97%. Lastly, the trigram distribution of three characters affix helps recognize four dialects, the Mauritanian, Tunisian, Qatari, and Kuwaiti, with an overall F-measure of 73% and an overall accuracy of 98%. Overall, for 18 dialects, the bigram model performed better than other models (unigram and trigram models).
Figure