Methods for NER
NER is composed of two sub-tasks: detecting entities (the span of text where a name starts and where it ends) and determining/classifying the type of entity. The methods used in NER are either based on linguistic grammars for each type of entity, either based on statistical methods. Semi-supervised learning techniques were proposed, but supervised learning, especially based on CRFs for sequence learning, are the most prevalent. Hand-crafted grammar-based systems typically obtain good precision, but at the cost of lower recall and months of work by experienced computational linguists. Supervised learning techniques were used more recently due the availability of annotated training datasets, mostly for newspaper texts, such as data from MUC 6, MUC 7, and ACE,12 and also the CoNLL 2003 English NER dataset [Tjong Kim Sang and De Meulder, 2003].
Tkachenko et al. [2013] described a supervised learning method for named-entity recognition. Feature engineering and learning algorithm selection are critical factors when designing a NER system. Possible features could include word lemmas, part-of-speech tags, and occurrence in some dictionary that encodes characteristic attributes of words relevant for the classification task. Tkachenko et al. [2013] included morphological, dictionary-based, WordNet-based, and global features. For their learning algorithm, the researchers chose CRFs, which have a sequential nature and ability to handle a large number of features. As also mentioned above, CRFs are widely used for the task of NER. For the Estonian dataset, the system produced a gold standard NER corpus, on which their CRF-based model achieved an overall F-score of 0.87.
He and Sun [2017] developed a semi-supervised leaning model based on deep neural networks (B-LSTM). This system combined transition probabilities with deep learning to train the model directly on F-score and label accuracy. The researchers used a modified, labeled corpus which corrected labeling errors in data developed by Peng and Dredze [2016] for NER in Chinese social media. They evaluated their model on NER and nominal mention tasks. The result for NER on the dataset of Peng and Dredze [2016] is the state-of-the-art NER system in Chinese Social Media. Their Bi-LSTM model achieved an F-score of 0.53.
Approaches based on deep learning were shown to benefit NER systems as well. Aguilar et al. [2017] proposed a multi-task approach by employing the task of Named Entity (NE) segmentation together with the task of fine-grained NE categorization. The multi-task neural network architecture learns higher-order feature representations from word and character sequences along with basic part-of-speech tags and gazetteer information. This neural network acts as a feature extractor to feed a Conditional Random Fields classifier. They obtained the best results in the 3rd Workshop on Noisy User-generated Text (WNUT-2017) with a 0.4186 entity detection F-score. Aguilar et al. [2018] extended the system’s architecture and improved the results with 2-3%.
Evaluation Measures for NER
The precision, recall, and F-measure can be calculated at sequence level (whole span of text) or at token level. The former is stricter because each named entity that is longer than one word has to have an exact start and end point. Once entities have been determined, the accuracy of assigning them to tags such as Person, Organization, etc., can be calculated.
Adaptation for Named Entity Recognition
Named entity recognition methods typically have 85–90% accuracy on long and carefully edited texts, but their performance decreases to 30–50% on tweets [Li et al., 2012a, Liu et al., 2012b, Ritter et al., 2011].
Ritter et al. [2011] reported that the Stanford NER obtains 44% accuracy on Twitter data. They also presented new NER methods for social media texts based on labeled Latent Dirichlet Allocation (LDA)13 [Ramage et al., 2009], that allowed their T-Seg NER system to reach an accuracy of 67%.
Derczynski et al. [2013b] reported that NER performance drops from 77% F-score on newspaper text to 60% on Twitter data, and that after adaptation it increases to 80% (with the ANNIE NER system from GATE) [Cunningham et al., 2002]. The performance on newspaper data was computed on the CoNLL 2003 English NER dataset [Tjong Kim Sang and De Meulder, 2003], while the performance on social media data was computed on part of the Ritter dataset [Ritter et al., 2011], which contains of 2,400 tweets comprising 34,000 tokens.
Particular attention is given to microtext normalization, as a way of removing some of the linguistic noise prior to part-of-speech tagging and entity recognition [Derczynski et al., 2013a, Han and Baldwin, 2011]. Some research has focused on named entity recognition algorithms specifically for Twitter messages, training new CRF model on Twitter data [Ritter et al., 2011].
An NER tool can detect various kinds of named entities, or focus only on one kind. For example, Derczynski and Bontcheva [2014] presented methods for detecting person entities. Chapter 3 will discuss methods for detecting other specific kinds of entities. The NER tools can detect entities, disambiguate them (when more than one entity with the same name exists), or solve co-references (when there are several ways to refer to the same entity).
2.7 EXISTING NLP TOOLKITS FOR ENGLISH AND THEIR ADAPTATION
There are many NLP tools developed for generic English and fewer for other languages. We list here several selected tools that have been adapted for social media text. Others may be available, just perhaps not useful in social media texts, although new tools are being developed or adapted. Nonetheless, we will briefly mention several toolkits that offer a collection of tools, also called suites if the tools can be used in a sequence of consecutive steps, from tokenization to named entity recognition or more. Some of them can be re-trained for social media texts.
SpaCy is a suite of NLP tool based on deep learning technology. It inlcued generalpurpose pretrained models to predict named entities, part-of-speech tags and syntactic dependencies, and it can be used out-of-the-box and fine-tuned on more specific data.14
The Stanford CoreNLP is an integrated suite of NLP tools for English programmed in Java, including tokenization, part-of-speech tagging, named entity recognition, parsing, and co-reference. A text classifier is also available.15
Open NLP includes tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and co-reference resolution, implemented in Java. It also includes maximum entropy and perceptron-based machine learning algorithms.16
FreeLing includes tools for English and several other languages: text tokenization, sentence splitting, morphological analysis, phonetic encoding, named entity recognition, POS tagging, chart-based shallow parsing, rule-based dependency parsing, nominal co-reference resolution, etc.17
NLTK is a suite of text processing libraries in Python for classification, tokenization, stemming, POS tagging, parsing, and semantic reasoning.18
GATE includes components for diverse language processing tasks, e.g., parsers, morphology, POS tagging. It also contains information retrieval tools, information extraction components for various languages, and many others. The information extraction system (ANNIE) includes a named entity detector.19
NLPTools is a library for NLP written in PHP, geared toward text classification, clustering, tokenizing, stemming, etc.20
Some components of these toolkits were re-trained for social media texts, such as the Stanford part-of-speech (POS) tagger by Derczynski et al. [2013b], and the OpenNLP chunker by Ritter et al. [2011], as we noted earlier.
One toolkit that