The sequence (w1, w2, …, wn) represents the sequence of characters in a sentence S. P(wi|w1, …wi−1) represents the probability of the character wi given the sequence w1, …wi−1.
A Naïve Bayes classifier is a simple probabilistic classifier based on applying Bayes’ theorem with strong (naïve) independence assumptions. In text classification, this classifier assigns the most likely category or class to a given document d from a set of pre-defined N classes as c1, c2, …, cN. The classification function f maps a document to a category (f : D → C) by maximizing the probability of the following equation [Peng and Schuurmans, 2003]:
where d and c denote the document and the category, respectively. In text classification a document d can be represented by a vector of T attributes d = (t1, t2,…, tT). Assuming that all attributes ti are independent given the category c, we can calculate P(d|c) with the following equation:
The attribute term ti can be a vocabulary term, local n-gram, word average length, or a global syntactic and semantic property [Peng and Schuurmans, 2003].
Sadat et al. [2014c] presented a set of experiments using these techniques with detailed examination of what models perform best under different conditions in a social media context. Experimental results showed that the Naïve Bayes classifier based on character bigrams can identify the 18 different Arabic dialects considered with an overall accuracy of 98%. The dataset used in the experiments was manually collected from forums and blogs, for each of the 18 dialects.
To look at the problem in more detail, Sadat et al. [2014a] applied both the n-gram Markov language model and the Naïve Bayes classifier to classify the eighteen Arabic dialects. The results of this study for the n-gram Markov language model is represented in Figure 2.5. This figure shows that the character-based unigram distribution helps the identification of two dialects, the Mauritanian and the Moroccan with an overall F-measure of 60% and an overall accuracy of 96%. Furthermore, the bigram distribution of 2 characters affix helps recognize 4 dialects, the Mauritanian, Moroccan, Tunisian, and Qatari, with an overall F-measure of 70% and overall accuracy of 97%. Lastly, the trigram distribution of three characters affix helps recognize four dialects, the Mauritanian, Tunisian, Qatari, and Kuwaiti, with an overall F-measure of 73% and an overall accuracy of 98%. Overall, for 18 dialects, the bigram model performed better than other models (unigram and trigram models).
Figure 2.5: Accuracies on the character-based n-gram Markov language models for 18 countries [Sadat et al., 2014a].
Since many dialects are related to a region, and these Arabic dialects are approximately similar, the authors also considered the accuracy of dialects group. Figure 2.6 shows the result on the three different character n-gram Markov language models and a classification on the six groups of divisions that were defined in Figure 2.4. Again, the bigram and trigram character Markov language models performed almost the same as in Figure 2.5, although the F-Measure of the bigram model for all dialect groups was higher than for the trigram model, except for the Egyptian dialect. Therefore, on average, for all dialects, the character-based bigram language model performed better than the character-based unigram and trigram models.
Figure 2.6: Accuracies on the character-based n-gram Markov language models for the six divisions/groups [Sadat et al., 2014a].
Figure 2.7 shows the results on the n-gram models using Naïve Bayes classifiers for the different countries, while Figure 2.8 shows the results on the n-gram models using Naïve Bayes classifiers for the six divisions according to Figure 2.4. The results show that the Naïve Bayes classifiers based on character unigram, bigram, and trigram have better results than the previous character-based unigram, bigram, and trigram Markov language models, respectively. An overall F-measure of 72% and an accuracy of 97% were noticed for the 18 Arabic dialects. Furthermore, the Naïve Bayes classifier that is based on a bigram model has an overall F-measure of 80% and an accuracy of 98%, except for the Palestinian dialect because of the small size of the data. The Naïve Bayes classifier based on the trigram model showed an overall F-measure of 78% and an accuracy of 98% except for the Palestinian and Bahrain dialects. This classifier could not distinguish between the Bahrain and the Emirati dialects because of the similarities on their three affixes. In addition, the Naïve Bayes classifier based on character bigrams performed better than the classifier based on character trigrams, according to Figure 2.7. Also, as shown in Figure 2.8, the accuracy of dialect groups for the Naïve Bayes classifier based on character bigram model yielded better results than the two other models (unigrams and trigrams).
Recently, Zaidan and Callison-Burch [2014] created a large monolingual data set rich in dialectal Arabic content called the Arabic Online Commentary Dataset. They used crowdsourcing for annotating the texts with the dialect label. They also presented experiments on the automatic classification of the dialects for this dataset, using similar word and character-based language models. The best results were around 85% accuracy for distinguishing MSA from dialectal data and lower accuracies for identifying the correct dialect for the latter case. Then they applied the classifiers to discover new dialectical data from a large Web crawl consisting of 3.5 million pages mined from online Arabic newspapers.
Figure 2.7: Accuracies on the character-based n-gram Naïve Bayes classifiers for 18 countries [Sadat et al., 2014a].
Several other projects focused on Arabic dialects: classification [Tillmann et al., 2014], code switching [Elfardy and Diab, 2013], and collecting a Twitter corpus for several dialects [Mubarak and Darwish, 2014].
2.9 SUMMARY
This chapter discussed the issue of adapting NLP tools to social media texts. One way is to use text normalization techniques, in order to make the text closer to standard carefully edited texts on which the NLP tools are usually trained. The normalization that can be achieved in practice is rather shallow and it does not seem to help much in improving the performance of the tools. The second way of adapting the tools is to re-train them on annotated social media data. This significantly improves the performance, although the amount of annotated data available for retraining is still small. Further development of annotated data sets for social media data is needed in order to reach very high levels of performance.
In the next chapter, we will look at advanced methods for various NLP tasks for social media texts. These tasks use as components some of the tools discussed in this chapter.
Figure 2.8: Accuracies on the character-based n-gram Naïve Bayes classifiers for the six divisions/groups [Sadat et al., 2014a].
1
https://sites.google.com/site/empirist2015/
2The F-score usually gives the same weight to precision and to recall, but it can weight one of them more when needed for an application.
3
http://www.comp.leeds.ac.uk/ccalas/tagsets/upenn.html
4This data set is available at http://code.google.com/p/ark-tweet-nlp/downloads/list
.
5A bracketing