Derczynski et al. [2013b] reported that NER performance drops from 77% F-score on newspaper text to 60% on Twitter data, and that after adaptation it increases to 80% (with the ANNIE NER system from GATE) [Cunningham et al., 2002]. The performance on newspaper data was computed on the CoNLL 2003 English NER dataset [Tjong Kim Sang and De Meulder, 2003], while the performance on social media data was computed on part of the Ritter dataset [Ritter et al., 2011], which contains of 2,400 tweets comprising 34,000 tokens.
Particular attention is given to microtext normalization, as a way of removing some of the linguistic noise prior to part-of-speech tagging and entity recognition [Derczynski et al., 2013a, Han and Baldwin, 2011]. Some research has focused on named entity recognition algorithms specifically for Twitter messages, training new CRF model on Twitter data [Ritter et al., 2011].
An NER tool can detect various kinds of named entities, or focus only on one kind. For example, Derczynski and Bontcheva [2014] presented methods for detecting person entities. Chapter 3 will discuss methods for detecting other specific kinds of entities. The NER tools can detect entities, disambiguate them (when more than one entity with the same name exists), or solve co-references (when there are several ways to refer to the same entity).
2.7 EXISTING NLP TOOLKITS FOR ENGLISH AND THEIR ADAPTATION
There are many NLP tools developed for generic English and fewer for other languages. We list here several selected tools that have been adapted for social media text. Others may be available, just perhaps not useful in social media texts, although new tools are being developed or adapted. Nonetheless, we will briefly mention several toolkits that offer a collection of tools, also called suites if the tools can be used in a sequence of consecutive steps, from tokenization to named entity recognition or more. Some of them can be re-trained for social media texts.
The Stanford CoreNLP is an integrated suite of NLP tools for English programmed in Java, including tokenization, part-of-speech tagging, named entity recognition, parsing, and co-reference. A text classifier is also available.9
Open NLP includes tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and co-reference resolution, implemented in Java. It also includes maximum entropy and perceptron-based machine learning algorithms.10
FreeLing includes tools for English and several other languages: text tokenization, sentence splitting, morphological analysis, phonetic encoding, named entity recognition, POS tagging, chart-based shallow parsing, rule-based dependency parsing, nominal co-reference resolution, etc.11
NLTK is a suite of text processing libraries in Python for classification, tokenization, stemming, POS tagging, parsing, and semantic reasoning.12
GATE includes components for diverse language processing tasks, e.g., parsers, morphology, POS tagging. It also contains information retrieval tools, information extraction components for various languages, and many others. The information extraction system (ANNIE) includes a named entity detector.13
NLPTools is a library for NLP written in PHP, geared toward text classification, clustering, tokenizing, stemming, etc.14
Some components of these toolkits were re-trained for social media texts, such as the Stanford POS tagger by Derczynski et al. [2013b], and the OpenNLP chunker by Ritter et al. [2011], as we noted earlier.
One toolkit that was fully adapted to social media text is GATE. A new module or plugin called TwitIE15 is available [Derczynski et al., 2013a] for tokenization of Twitter texts, as well as POS tagging, name entities recognition, etc.
Two new toolkits were built especially for social media texts: the TweetNLP tools developed at CMU and the Twitter NLP tools developed at the University of Washington (UW).
TweetNLP is a Java-based tokenizer and part-of-speech tagger for Twitter text [Owoputi et al., 2013]. It includes training data of manually labeled POS annotated tweets (that we noted above), a Web-based annotation tool, and hierarchical word clusters from unlabeled tweets.16 It also includes the TweeboParser mentioned above.
The UW Twitter NLP Tools [Ritter et al., 2011] contain the POS tagger and the annotated Twitter data (mentioned above—see adaptation of POS taggers).17
A few other tools for English are in development, and a few tools for other languages have been adapted or can be adapted to social media text. The development of the latter is slower, due to the difficulty in producing annotated training data for many languages, but there is progress. For example, a treebank for French social media texts was developed by Seddah et al. [2012].
2.8 MULTI-LINGUALITY AND ADAPTATION TO SOCIAL MEDIA TEXTS
Social media messages are available in many languages. Some messages could be mixed, for example part in English and part in another language. This is called “code switching.” If tools for multiple languages are available, a language identification tool needs to be run on the texts before using the right language-specific tools for the next processing steps.
2.8.1 LANGUAGE IDENTIFICATION
Language identification can reach very high accuracy for long texts (98–99%), but it needs adaptation to social media texts, especially to short texts such as Twitter messages.
Derczynski et al. [2013b] showed that language identification accuracy decreases to around 90% on Twitter data, and that re-training can lead to 95–97% accuracy levels. This increase is easily achievable for tools that classify into a small number of languages, while tools that classify into a large number of languages (close to 100 languages) cannot be further improved on short informal texts. Lui and Baldwin [2014] tested six language identification tools and obtained the best results on Twitter data by majority voting over three of them, up to an F-score of 0.89.
Barman et al. [2014] presented a new dataset containing Facebook posts and comments that exhibit code mixing between Bengali, English, and Hindi. The researchers demonstrated some preliminary word-level language identification experiments using this dataset. The methods surveyed included a simple unsupervised dictionary-based approach, supervised word-level classification with and without contextual clues, and sequence labeling using Conditional Random Fields. The preliminary results demonstrated the superiority of supervised classification and sequence labeling over dictionary-based classification, suggesting that contextual clues are necessary for accurate classifiers. The CRF model achieved the best result with an F-score of 0.95.
There is a lot of work on language identification in social media. Twitter has been a favorite target, and a number of papers deal with language identification of Twitter messages specifically Bergsma et al. [2012], Carter et al. [2013], Goldszmidt et al. [2013], Mayer [2012], Tromp and Pechenizkiy [2011]. Tromp and Pechenizkiy [2011] proposed a graph-based n-gram approach that works well on tweets. Lui and Baldwin [2014] looked specifically at the problem of adapting existing language identification tools to Twitter messages, including challenges in obtaining data for evaluation, as well as the effectiveness of proposed strategies. They tested several tools on Twitter data (including a newly collected corpus for English, Japanese, and Chinese). The tests were done with off-the-shelf tools, before and after a simple cleaning of the Twitter data, such as removing hashtags, mentions, emoticons, etc. The improvement after the cleaning was small. Bergsma et al. [2012] looked at less common languages, in order to collect language-specific corpora. The nine languages they focused on (Arabic, Farsi, Urdu, Hindi, Nepali, Marathi, Russian, Bulgarian, Ukrainian) use three different non-Latin scripts: Arabic, Devanagari, and Cyrillic. Their method for language identification was based on language models.
Most of the methods used only the text of the message, but Carter et al. [2013] also looked at the use of metadata, an approach which is unique to social media. They identified five microblog characteristics that can help in language identification: the language profile of the blogger, the content of an attached hyperlink, the language profile of other