Taking a broader view of social media, Nguyen and Dogruöz [2013] looked at language identification in a mixed Dutch-Turkish Web forum. Mayer [2012] considered language identification of private messages between eBay users.
Here are some of the available tools for language identification.
• langid.py18 [Lui and Baldwin, 2012] works for 97 languages and uses a feature set selected from multiple sources, combined via a multinomial Naïve Bayes classifier.
• CLD2,19 the language identifier embedded in the Chrome Web browser,20 uses a Naïve Bayes classifier and script-specific tokenization strategies.
• LangDetect21 is a Naïve Bayes classifier, using a representation based on character n-grams without feature selection, with a set of normalization heuristics.
• whatlang [Brown, 2013] uses a vector-space model with per-feature weighting over character n-grams.
• YALI22 computes a per-language score using the relative frequency of a set of byte n-grams selected by term frequency.
• TextCat23 is an implementation of the method of Cavnar and Trenkle [1994] and it uses an adhoc rank-order statistic over character n-grams.
Only some of the available tools were trained directly on social media data.
• LDIG24 is an off-the-shelf Java language identification tool targeted specifically at Twitter messages. It has pre-trained models for 47 languages. It uses a document representation based on data structures named tries.25
• MSR-LID [Goldszmidt et al., 2013] is based on rank-order statistics over character n-grams, and Spearman’s coefficient to measure correlations. Twitter-specific training data was acquired through a bootstrapping approach.
Some datasets of social media texts annotated with language labels are available.
• The dataset of Tromp and Pechenizkiy [2011] contains 9,066 Twitter messages labeled with one of the six languages: German, English, Spanish, French, Italian, and Dutch.26
• The Twituser language identification dataset27 of Lui and Baldwin [2014] for English, Japanese, and Chinese.
2.8.2 DIALECT IDENTIFICATION
Sometimes it is not enough that a language has been identified correctly. A case in point is Arabic. It is the official language in 22 countries, spoken by more than 350 million people worldwide.28 Modern Standard Arabic (MSA) is the written form of Arabic used in education; it is also the formal communication language. Arabic dialects or colloquial languages are spoken varieties of Arabic, and spoken daily by Arab people. There are more than 22 dialects; some countries share the same dialect, while many dialects may exist alongside MSA within the same Arab country. Arabic speakers prefer to use their own local dialect. Recently, more attention has been given to the Arabic dialects and the written varieties of Arabic found on social networking sites such as chats, micro-blogs, blogs, and forums which are the target of research on sentiment analysis and opinion extraction.
Huang [2015] shows us an approach to improving Arabic dialect classification with semi-supervised learning. He trained multiple classifiers using a combination of weakly supervised, strongly supervised, and unsupervised classifiers. These combinations yielded significant and consistent improvement on two test sets. The dialect classification accuracy improved by 5% over the strongly supervised classifier and 20% over the weakly supervised classifier. Furthermore, when applying the improved dialect classifier to build a MSA language model (LM), the new model size was reduced by 70%, while the English-Arabic translation quality improved by 0.6 BLEU points.
Arabic Dialects (AD) or daily language differs from MSA especially in social media communication. However, most Arabic social media texts have mixed forms and many variations especially between MSA and AD. Figure 2.3 illustrates the AD distribution.
Figure 2.3: Arabic dialects distribution and variation across Asia and Africa [Sadat et al., 2014a].
There is a possible division of regional language within the six regional groups, as follows: Egyptian, Levantine, Gulf, Iraqi, Maghrebi, and others, as shown in Figure 2.4.
Dialect identification is closely related to the language identification problem. The dialect identification task attempts to identify the spoken dialect from within a set of texts that use the same character set in a known language.
Due to the similarity of dialects within a language, dialect identification is more difficult than language identification. Machine learning approaches and language models which are used for language identification need to be adapted for dialect identification as well.
Several projects on NLP for MSA have been carried out, but research on Dialectal Arabic NLP is in early stages [Habash, 2010].
When processing Arabic for the purposes of social media analysis, the first step is to identify the dialect and then map the dialect to MSA, because there is a lack of resources and tools for Dialectal Arabic NLP. We can therefore use MSA tools and resources after mapping the dialect to MSA.
Figure 2.4: Division of Arabic dialects in six groups/divisions [Sadat et al., 2014a].
Diab et al. [2010] have run the COLABA project, a major effort to create resources and processing tools for Dialectal Arabic blogs. They used the BAMA and MAGEAD morphological analyzers. This project focused on four dialects: Egyptian, Iraqi, Levantine, and Moroccan.
Several tools for MSA regarding text processing—BAMA, MAGED, and MADA—will now be described briefly.
BAMA (Buckwalter Arabic Morphological Analyzer) provides morphological annotation for MSA. The BAMA database contains three tables of Arabic stems, complex prefixes, and complex suffixes and three additional tables used for controlling prefix-stem, stem-suffix, and prefix-suffix combinations [Buckwalter, 2004].
MAGEAD is a morphological analyzer and generator for the Arabic languages including MSA and the spoken dialects of Arabic. MAGEAD is modified to analyze the Levantine dialect [Habash and Rambow, 2006].
MADA+TOKEN is a toolkit for morphological analysis and disambiguation for the Arabic language that includes Arabic tokenization, discretization, disambiguation, POS tagging, stemming, and lemmatization. MADA selects the best analysis result within all possible analyses for each word in the current context by using SVM models classifying into 19 weighted morphological features. The selected analyses carry complete diacritic, lexemic, glossary, and morphological information. TOKEN takes the information provided by MADA to generate tokenized output in a wide variety of customizable formats. MADA depends on three resources: BAMA, the SRILM toolkit, and SVMTools [Habash et al., 2009].
Going back to the problem of AD identification, we give here a detailed example, with results. Sadat et al. [2014c] provided a framework for AD classification using probabilistic models across social media datasets. They incorporated the two popular techniques for language identification: the character n-gram Markov language model and Naïve Bayes classifiers.29
The Markov model calculates the probability that an input text is derived from a given language model built from training data [Dunning, 1994]. This model enables the computation of the probability P(S) or likelihood, of a sentence S, by using the following chain formula in the following