• Chapter 6: The last chapter summarizes the methods and applications described in the preceding chapters. We conclude with a discussion of the high potential for research, given the social media analysis needs of end-users.
As mentioned in the Preface, the intended audience of this book is researchers that are interested in developing tools and applications for automatic analysis of social media texts. We assume that the readers have basic knowledge in the area of natural language processing and machine learning. Nonetheless, we will try to define as many notions as we can, in order to facilitate the understanding for beginners in these two areas. We also assume basic knowledge of computer science in general.
1.5 SUMMARY
In this chapter, we reviewed the structure of social network and social media data as the collection of textual information on the Web. We presented semantic analysis in social media as a new opportunity for big data analytics and for intelligent applications. Social media monitoring and analyzing of the continuous flow of user-generated content can be used as an additional dimension which contains valuable information that would not have been available from traditional media and newspapers. In addition, we mentioned the challenges with social media data, which are due to their large size, and to their noisy, dynamic, and unstructured nature.
1 http://people.eng.unimelb.edu.au/tbaldwin/pubs/starsem2014.pdf
3 http://www.cision.com/uk/files/2013/10/social-journalism-study-2013.pdf
4All publications could be found at ACL Anthology https://www.aclweb.org/anthology/
CHAPTER 2
Linguistic Pre-processing of Social Media Texts
2.1 INTRODUCTION
In this chapter, we discuss current Natural Language Processing (NLP) linguistic pre-processing methods and tools that were adapted for social media texts. We survey the methods used for adaptation to this kind of texts. We briefly define the evaluation measures used for each type of tool in order to be able to mention the state-of-the-art results.
In general, evaluation in NLP can be done in several ways:
• manually, by having humans judge the output of each tool;
• automatically, on test data that humans have annotated with the expected solution ahead of time; and
• task-based, by using the tools in a task and evaluating how much they contribute to the success in the task.
We primarily focus on the second approach here. It is the most convenient since it allows the automatic evaluation of the tools repeatedly after changing/improving their methods, and it allows comparing different tools on the same test data. Care should be taken when human judges annotate data. There should be at least two annotators that are given proper instructions on what and how to annotate (in an annotation manual). There needs to be a reasonable agreement rate between the two or more annotators, to ensure the quality of the obtained data. When there are disagreements, the expected solution will be obtained by resolving the disagreements by taking a vote (if there are three annotators or more, an odd number), or by having the annotators discuss until they reach an agreement (if there are only two annotators, or an even number). When reporting the inter-annotator agreement for a dataset, the kappa statistic also needs to be reported, in order to compensate the obtained agreement for possible agreements due to chance [Artstein and Poesio, 2008, Carletta, 1996].
NLP tools often use supervised machine learning, and the training data are usually annotated by human judges. In such cases, it is convenient to keep aside some of the annotated data for testing and to use the remaining data to train the models. Many of the methods discussed in this book use machine learning algorithms for automatic text classification. That is why we give a very brief introduction here. See, e.g., [Witten and Frank, 2005] for details of the classical algorithms and [Sebastiani, 2002] for how they can be applied to text data. Also see [Eisenstein, 2019] for more details about deep learning classification techniques for text data.
A supervised text classification model predicts the label c of an input x, where x is a vector of feature values extracted from document d. The class c can take two or more possible values from a specified set (or even continuous numeric values, in which case the classifier is called a regression model). The training data contain document vectors for which the classes are provided. The classifier uses the training data to learn associations between features or combinations of features that are strongly associated with one of the classes but not with the other classes. In this way, the trained model can make predictions for unseen test data in the future. There are many classification algorithms. We name here only a few of the classifiers popular in NLP tasks.
Decision trees take one feature at a time, compute its power of discriminating between the classes and build a tree with the most discriminative features in the upper part of the tree; decision trees are useful because the models can be easily understood by humans. Naïve Bayes is a classifier that learns the probabilities of association between features and classes; these models are used because they are known to work well with text data (see a more detailed description in Section 2.8.1). SVMs compute a hyper plane that separates two classes and they can efficiently perform nonlinear classification using what is called a kernel to map the data into a high-dimensional feature space where it become linearly separable [Cortes and Vapnik, 1995]; SVMs are probably the most often used classifiers due to their high performance on many tasks that have small amounts of training data.
Lately, most linguistic tools and applications employ deep neural network classifiers, which were shown to lead to better performance when large amounts of training data are available. These classifiers include Recurrent Neural Networks (RNN) or Convolutional Neural Networks (CNN) [LeCun et al., 2015]. A special type of RNN is Long Short-Term Memory (LSTM) networks, which add some forget gates to allow modeling long-distance context [Hochreiter and Schmidhuber, 1997].
There are many machine learning libraries that can be used. We mention here only a few: Weka1 and scikit-learn2 for classical algorithms and PyTorch3 and TensorFlow4 for deep learning algorithms. The first one is in Java, while the last three are in Python.
A sequence-tagging model can be seen as a classification model, but fundamentally differs from a conventional one, in the sense that instead of dealing with a single input x and a single label c each time, it predicts a sequence of labels c = (c1, c2, …, cn) based on a sequence of inputs x = (x1, x2, …, xn) and the predictions from the previous steps. It was applied with success in natural language processing (for sequential data such as sequences of part-of-speech tags, discussed in the previous chapter) and in bioinformatics (for DNA sequences). There exist a number of sequence-tagging models, including Hidden Markov Model (HMM) [Baum and Petrie, 1966], Conditional Random Field (CRF) [Lafferty et al., 2001], and Maximum Entropy Markov Model (MEMM) [Berger et al., 1996].
Sequence-to-sequence models based on deep learning can also be used to transform sequences into other sequences, for example a sequence of words into a sequence of part-of-speech tags. They are also being used for the latest machine translation system, to transform a sequence of words in one language into a sequence of words in another language [Gehring et al., 2017].
Before