Re-training NLP tools for social media texts is relatively easy if annotated training data are available. In general, adapting a tool to a specific domain or a specific type of text requires producing annotated training data for that kind of text. It is easy to collect text of the required kind, but to annotate it can be a difficult and time-consuming process.
Currently, some annotated social media data have become available, but the volume is not high enough. Several NLP tools have been re-trained on newly annotated data, sometimes by also keeping the original annotated training data for newspaper texts, in order to have a large enough training set. Another approach is to use some unlabeled social media text in an unsupervised manner in addition to the small amounts of annotated social media text.
Another question is what kinds of social media texts to use for training. It seems that Twitter messages are more difficult to process than blog posts or messages from forums. Because of the limitation of Twitter messages to 140 characters, more abbreviations and shortened forms of words are used, and more simplified syntax. Therefore, training data should include several kinds of social media texts (unless somebody is building a tool designed for a particular kind of social media text).
We define the tasks accomplished by each kind of tool and we discuss techniques for adapting them to social media texts.
Table 2.2: Examples of tokenization
2.3 TOKENIZERS
The first step in processing a text is to separate the words from punctuation and other symbols. A tool that does this is called a tokenizer. White space is a good indicator of words separation (except in some languages, e.g., Chinese), but even white space is not sufficient. The question of what is a word is not trivial. When doing corpus analysis, there are strings of characters that are clearly words, but there are strings for which this is not clear. Most of the time, punctuation needs to be separated from words, but some abbreviations might contain punctuation characters as part of the word. Take, for example, the sentence: “We bought apples, oranges, etc.” The commas clearly need to be separated from the word “apples” and from the word “oranges,” but the dot is part of the abbreviation “etc.” In this case, the dot also indicates the end of the sentence (two dots were reduced to one). Other examples among the many issues that appear are: how to treat numbers (if they contain commas or dots, these characters should not be separated), or what to do with contractions such as “don’t” (perhaps to expand them into two words “do” and “not”).
While tokenization usually consists of two subtasks (sentence boundary detection and token boundary detection), the EmpiriST shared task5 provided sentence boundaries and the participating teams only had to detect token boundaries. Missing whitespace characters presents a major challenge to the task of tokenization. Table 2.2 shows a few examples with their correct tokenization.
Methods for Tokenizers
Horsmann and Zesch [2016] evaluated a method for dealing with token boundaries consisting of three steps. First, the researchers split the text according to the white space characters. Then they employed regular expressions to refine the splitting of alpha-numerical text segments from punctuation characters in special character sequences such as similes. Finally, these sequences of punctuation are reassembled. They merge the most common combinations of characters into a single token using the training data, and use word lists to merge abbreviations with their following dot character. They increase accuracy in the experiment using more in-domain training data.
Evaluation Measures for Tokenizers
Accuracy is a simple measure that calculates how many correct decisions a tool makes. When not all the expected tokens are retrieved, precision and recall are the measure to report. The precision of the tokens recognition measures how many tokens are correct out of how many were found. Recall measures the coverage (from the tokens that should have been retrieved, how many were found). F-measure (or F-score) is often reported when one single number is needed, because F-measure is the harmonic mean of the precision and recall, and it is high only when both the precision and the recall are high.6 Evaluation measures are rarely reported for tokenizers, one exception being the CleanEval shared task which focused on tokenizing text from web pages [Baroni et al., 2008].
Many NLP projects tend to not mention what kind of tokenization they used, and focus more on higher-level processing. Tokenization, however, can have a large effect on the results obtained at the next levels. For example, Fokkens et al. [2013] replicated two high-level tasks from previous work and obtained very different results, when using the same settings but different tokenization.
Adapting Tokenizers to Social Media Texts
Tokenizers need to deal with the specifics of social media texts. Emoticons need to be detected as tokens. For Twitter messages, user names (starting with @), hashtags (starting with #), and URLs (links to web pages) should be treated as tokens, without separating punctuation or other symbols that are part of the token. Some shallow normalization can be useful at this stage. Derczynski et al. [2013b] tested a tokenizer on Twitter data, and its F-measure was around 80%. By using regular expressions designed specifically for Twitter messages, they were able to increase the F-measure to 96%. More about such regular expressions can be found in [O’Connor et al., 2010].
2.4 PART-OF-SPEECH TAGGERS
Part-of-speech (POS) taggers determine the part of speech of each word in a sentence. They label nouns, verbs, adjectives, adverbs, interjections, conjunctions, etc. Often they use finergrained tagsets, such as singular nouns, plural nouns, proper nouns, etc. Different tagsets exist, one of the most popular being the Penn TreeBank tagset7 [Marcus et al., 1993]. See Table 2.3 for one of its more popular lists of the tags. The models embedded in the POS taggers are often complex, based on Hidden Markov Models [Baum and Petrie, 1966], Conditional Random Fields [Lafferty et al., 2001], etc. They need annotated training data in order to learn probabilities and other parameters of the models.
Table 2.3: Penn TreeBank tagset
Number | Tag | Description |
1 | CC | Coordinating conjunction |
2 | CD | Cardinal number |
3 | DT | Determiner |
4 | EX | Existential there |
5 | FW | Foreign word |
6 | IN | Preposition or subordinating conjunction |
7 | JJ | Adjective |
8 | JJR | Adjective, comparative |
9 | JJS | Adjective, superlative |
10 | LS | List item marker |
11 | MD | Model |
12 |
|