NLTK6 also has several similar tokenizers to ANNIE, one based on regular expressions, written in Python.
2.5 SENTENCE SPLITTING
Sentence detection (or sentence splitting) is the task of separating text into its constituent sentences. This typically involves determining whether punctuation, such as full stops, commas, exclamation marks, and question marks, denote the end of a sentence or something else (quoted speech, abbreviations, etc.). Most sentence splitters use lists of abbreviations to help determine this: a full stop typically denotes the end of a sentence unless it follows an abbreviation such as Mr., or lies within quotation marks. Other issues involve determining sentence structure when line breaks are used, such as in addresses or in bulleted lists. Sentence splitters vary in how such things are handled.
More complex cases arise when the text contains tables, titles, formulae, or other formatting markup: these are usually the biggest source of error. Some splitters ignore these completely, requiring a punctuation mark as a sentence boundary. Others use two consecutive new lines or carriage returns as an indication of a sentence end, while there are also cases when even a single newline or carriage return character would indicate end of a sentence (e.g., comments in software code or bulleted/numbered lists which have one entry per line). GATE’s ANNIE sentence splitter actually provides several variants in order to let the user decide which is the most appropriate solution for their particular text. HTML formatting tags, Twitter hashtags, wiki syntax, and other such special text types are also somewhat problematic for general-purpose sentence splitters which have been trained on well-written corpora, typically newspaper texts. Note that sometimes tokenization and sentence splitting are performed as a single task rather than sequentially.
Sentence splitters generally make use of already tokenized text. GATE’s ANNIE sentence splitter uses a rule-based approach based on GATE’s JAPE pattern-action rule-writing language [7]. The rules are based entirely on information produced by the tokenizer and some lists of common abbreviations, and can easily be modified as necessary. Several variants are provided, as mentioned above.
Unlike ANNIE, the OpenNLP sentence splitter is typically run before the tokenization module. It uses a machine learning approach, with the models supplied being trained on untokenized text, although it is also possible to perform tokenization first and let the sentence splitter process the already tokenized text. One flaw in the OpenNLP splitter is that because it cannot identify sentence boundaries based on the contents of the sentence, it may cause errors on articles which have titles since these are mistakenly identified to be part of the first sentence.
NLTK uses the Punkt sentence segmenter [8]. This uses a language-independent, unsupervised approach to sentence boundary detection, based on identifying abbreviations, initials, and ordinal numbers. Its abbreviation detection, unlike most sentence splitters, does not rely on precompiled lists, but is instead based on methods for collocation detection such as log-likelihood.
Stanford CoreNLP makes use of tokenized text and a set of binary decision trees to decide where sentence boundaries should go. As with the ANNIE sentence splitter, the main problem it tries to resolve is deciding whether a full stop denotes the end of a sentence or not.
In some studies, the Stanford splitter scored the highest accuracy out of common sentence splitters, although performance will of course vary depending on the nature of the text. State-of-the-art sentence splitters such as the ones described score about 95–98% accuracy on well-formed text. As with most linguistic processing tools, each one has strengths and weaknesses which are often linked to specific features of the text; for example, some splitters may perform better on abbreviations but worse on quoted speech than others.
2.6 POS TAGGING
Part-of-Speech (POS) tagging is concerned with tagging words with their part of speech, e.g., noun, verb, adjective. These basic linguistic categories are typically divided into quite fine-grained tags, distinguishing for instance between singular and plural nouns, and different tenses of verbs. For languages other than English, gender may also be included in the tag. The set of possible tags used is critical and varies between different tools, making interoperability between different systems tricky. One very commonly used tagset for English is the Penn Treebank (PTB) [9]; other popular sets include those derived from the Brown corpus [10] and the LOB (Lancaster-Oslo/Bergen) Corpus [11], respectively. Figure 2.4 shows an example of some POS-tagged text, using the PTB tagset.
Figure 2.4: Representation of a POS-tagged sentence.
The POS tag is determined by taking into account not just the word itself, but also the context in which it appears. This is because many words are ambiguous, and reference to a lexicon is insufficient to resolve this. For example, the word love could be a noun or verb depending on the context (I love fish vs. Love is all you need).
Approaches to POS tagging typically use machine learning, because it is quite difficult to describe all the rules needed for determining the correct tag given a context (although rule-based methods have been used). Some of the most common and successful approaches use Hidden Markov models (HMMs) or maximum entropy. The Brill transformational rule-based tagger [12], which uses the PTB tagset, is one of the most well-known taggers, used in several major NLP toolkits. It uses a default lexicon and ruleset acquired from a large corpus of training data via machine learning. Similarly, the OpenNLP POS tagger also uses a model learned from a training corpus to predict the correct POS tag from the PTB tagset. It can be trained with either a Maximum Entropy or a Perceptron-based model. The Stanford POS tagger is also based on a Maximum Entropy approach [13] and makes use of the PTB tagset. The TNT (Trigrams’n’Tags) tagger [14] is a fast and efficient statistical tagger using an implementation of the Viterbi algorithm for second-order Markov models.
In terms of major NLP toolkits, some (such as Stanford CoreNLP) have their own POS taggers, as described above, while others use existing implementations or variants on them. For example, NLTK has Python implementations of the Brill tagger, the Stanford tagger, and the TNT tagger. GATE’s ANNIE English POS tagger [15] is a modified version of the Brill tagger trained on a large corpus taken from the Wall Street Journal. It produces a POS tag as an annotation on each word or symbol. One of the big advantages of this tagger is that the lexicon can easily be modified manually by adding new words or changing the value or order of the possible tags associated with a word. It can also be retrained on a new corpus, although this requires a large pre-tagged corpus of text in the relevant domain/genre, which is not easy to find.
The accuracy of these general-purpose, reusable taggers is typically excellent (97–98%) on texts similar to those on which the taggers have been trained (mostly news articles). However, the accuracy can fall sharply when presented with new domains, genres, or noisier data such as social media. This can have a serious knock-on effect on other processes further down the pipeline such as Named Entity recognition, ontology learning via lexico-syntactic patterns, relation and event extraction, and even opinion mining, which all need reliable POS tags in order to produce high-quality results.
2.7 MORPHOLOGICAL ANALYSIS AND STEMMING