To increase the amount of geolocated data, researchers have developed a range of methods for automatically inferring location from available user data [Han et al., 2014]. There are a variety of methods for inferring location information, and we summarize these techniques in the next chapter in Section 4.3.1.
Location stability Some location data is dynamic, meaning that it is updated to the current location for each message that is sent. GPS-tagged tweets and IP address geolocation are dynamic: they describe the location of the user when the activity was performed. Other information is static and may stay the same as a person moves around. For example, the location field of a user profile typically describes a user’s primary home location, and does not change as a person travels. In general, the location of a user can be difficult to quantify, as locations can change and their accuracy can be subjective. For example, the identification of a user as residing in New York City, but who actually resides across the river in New Jersey, may be sufficient for many applications, even though the identification has the wrong state. Similarly, for a user who resides in El Paso, Texas, United States and works in Juarez, Chihuahua, Mexico, either city would be an accurate location despite being in different countries. In contrast, confusing a state or country would be a major geolocation error in most cases. See Dredze et al. [2013] for some of the challenges with evaluating geolocation.
3.4.3 SOCIAL NETWORK STRUCTURE
Another useful type of data is the network structure of a social platform, meaning the links or relationships between platform users. Social network structure is important for certain types of public health surveillance, such as predicting the spread of disease [Sadilek et al., 2012a,b] or understanding social support for healthy behaviors such as smoking cessation [Cobb et al., 2011].
Many platforms explicitly encode relationships between users. For example, in Facebook, users become “friends” upon mutual agreement. In Twitter, users “follow” other users, meaning that they subscribe to read the content of their followers. Following a user on Twitter is an asymmetric act and does not require mutual consent.
It is also possible to implicitly construct a social network. For example, one might infer a relationship between users if they communicate on a social network [Rao et al., 2010]. Even if explicit network information is available, implicit communication networks may also serve as a useful alternative, as these networks imply a different type of relationship. For example, Twitter users who communicate with each other may have a stronger relationship than users who follow each other but do not communicate. An “affiliation network” connects two users who share a common activity, like reading the same article or purchasing the same product [Mishra et al., 2013].
Network relationships can be either directed or undirected. Undirected relationships are symmetric, such as “friend” relationships in Facebook. Directed relationships flow from one user to another, such as a “follow” relationship in Twitter, in which one user follows another. Directed relationships can always be treated as undirected, if needed for a task, by removing the directionality.
3.5 DATA COLLECTION
We provide a brief summary of some of the most popular data sources in the social media research community and their associated APIs (application program interfaces) to serve as a starting guide. We encourage readers to visit the developer pages of the platform of interest for more information. Working directly with an API may be beyond the ability of researchers without technical training, although there are some guides written specifically for non-technical researchers (see Denecke et al. [2013], Yoon et al. [2013], Schwartz and Ungar [2015]). Some of the platforms described below make data available in easy-to-use formats, such as comma-separate values (CSV), usually including a rich variety of metadata, and others sell data in formats suitable for non-technical researchers.
Twitter makes it very easy to obtain a wide variety of data using their API.2 The streaming API provides a constant real-time data feed (approximately 1% of all tweets), while the REST API allows for searching through (limited) historical data. This allows researchers to collect targeted datasets based on specific keywords, locations, or users. There are a variety of tutorials and tools available for quickly starting a Twitter data collection.3 Commercial options for larger data collection are available through Gnip, Twitter’s enterprise API platform, which can provide samples larger than 1% and historical data matching specific queries. Gnip also provides data from other platforms, including Instagram and YouTube.4
Facebook also has a robust API that allows for a number of different data queries,5 including the Graph API, which is the primary way to read from the Facebook social graph. However, unlike Twitter, most Facebook data is not publicly available, and so it is not available unless one has explicit permissions from the data author. Additionally, Facebook provides various search methods but not a streaming method, making it difficult to obtain random samples of data. An alternate approach is to develop a Facebook app that obtains explicit sharing permissions from users. While time consuming to develop and promote, investments in Facebook apps can yield valuable datasets [De Choudhury et al., 2014a, Schwartz et al., 2013].
Reddit is a popular online forum and content-sharing service, where users can submit content and leave comments. It is one of the most popular forum sites, and therefore hosts content on a wide range of topics including health. Reddit provides an API that makes it easy to download content.6
Google Trends provides aggregated keyword search data going back to 2004, with the ability to show trends specific to a location, time or category.7 The site also suggests related queries, so that users can expand their search to find other queries relevant to their topic of interest. Google allows data to be exported in CSV format. Bing provides a similar tool, though it is aimed at advertisers.8
Additionally, some health-specific data resources are described in Section 5.1.4 for the purpose of disease surveillance.
1
https://www.google.com/trends/
2
https://dev.twitter.com/
3 For example, see http://socialmedia-class.org/twittertutorial.html
and https://github.com/mdredze/twitter_stream_downloader
.
4
https://gnip.com/sources/
5
https://developers.facebook.com/
6
https://www.reddit.com/dev/api
7
https://www.google.com/trends/
8
http://www.bing.com/toolbox/keywords
CHAPTER 4
Methods of Monitoring
This chapter surveys methodology: the types of information that can be analyzed and how to do so, covering machine learning, statistical modeling, and qualitative methods. We will start by discussing quantitative methods—statistical analysis of data—including large-scale computational approaches to categorizing and extracting trends from social data, both at the level of populations and individuals. We also discuss validation, how you know when to trust your analysis. We then briefly discuss qualitative methods as a potentially richer but smaller-scale methodology. Lastly, we discuss different issues involved in designing a study, including methods for inferring population demographics, an important component of public health research.
This chapter touches on some advanced concepts in machine learning that won’t be taught in depth in this book—though we do provide a few pointers to other tutorials and tools. Our aim is to provide a high-level overview of these methods, introducing important terminology, surveying different ways of approaching a problem, and giving examples of typical pipelines for conducting social monitoring.