An Introduction to Text Mining. Gabe Ignatow. Читать онлайн. Newlib. NEWLIB.NET

Автор: Gabe Ignatow
Издательство: Ingram
Серия:
Жанр произведения: Социология
Год издания: 0
isbn: 9781506337029
Скачать книгу
and Limitations of Online Digital Resources for Social Science Research

      The use of online digital resources, and in particular of social media, comes with its plusses and minuses. Salganik (in press) provided a good summary of the characteristics of big data in general, many of which apply to social media in particular. He grouped characteristics into those that are good for research and those that are not good for research.

      Among the characteristics that make big data good for research are (a) its size, which can allow for the observation of rare events, for causal inferences, and generally for more advanced statistical processing that is not otherwise possible when the data are small; (b) its “always-on” property, which provides a time dimension to the data and makes it suitable to study unexpected events and produce real time measurements (e.g., capture people’s reactions during a tornado, by analyzing the tweets from the affected area); and (c) its nonreactive nature, which implies that the respondents behave more naturally due to the fact that they are not aware of their data being captured (as it is the case with surveys).

      Then there are also characteristics that make big data less appealing to research, such as (a) its incompleteness—that is, often digital data collections lack demographics or other information that is important for social studies; (b) its inherent bias, in that the contributors to such online resources are not a random sample of the people—consider, for instance, the people who tweet many tweets a day versus those who choose to never tweet; they represent different types of populations with different interests, personalities, and values, and even the largest collection of tweets will not capture the behaviors of those who are not users of Twitter; (c) its change over time, in terms of users (who generates social media data and how it generates it) and platforms (how is the social media data being captured), which makes it difficult to conduct longitudinal studies; and (d) finally its susceptibility to algorithmic confounds, which are properties that seem to belong to the data being studied which in fact are caused by the underlying system used to collect the data—as in the seemingly magic number of 20 friends that many people seem to have on Facebook, which turns out to be an effect of the Facebook platform that actively encourages people to make friends until they reach 20 friends (Salganik, in press). In addition, some types of digital data are inaccessible—for example, e-mails, queries sent to search engines, phone calls, and so forth, which makes it difficult to conduct research on behaviors associated with those data types.

      Examples of Social Science Research Using Digital Data

      There are examples of social science research studies that use social media data in most of the chapters of this textbook. If you are interested in using Facebook data for your own project, it is important to review the studies discussed in Chapter 3 on the Facebook ethics controversy. In addition, research by the sociologist Hanna (2013) on using Facebook to study social movements may be a useful starting point. Hanna reviewed procedures for analyzing social movements such as the Arab Spring and Occupy movements by applying text mining methods to Facebook data. Hanna uses the Natural Language Toolkit (NLTK; www.nltk.org) and the R package ReadMe (http://gking.harvard.edu/readme) to analyze mobilization patterns of Egypt’s April 6 youth movement. He corroborated results from his text mining methods with in-depth interviews with movement participants.

      If you are interested in using Twitter data, two Twitter-based thematic analysis (see Chapter 11) studies are good places to start. The first is a study of the live Twitter chat of the Centers for Disease Control and Prevention conducted by Lazard, Scheinfeld, Bernhardt, Wilcox, and Suran (2015). Lazard’s team collected, sorted, and analyzed users’ tweets to reveal major themes of public concern with the symptoms and life span of the virus, disease transfer and contraction, safe travel, and protection of one’s body. Lazard and her team used SAS Text Miner (www.sas.com/en_us/software/analytics/text-miner.html) to organize and analyze the Twitter data.

      A second thematic analysis study that uses Twitter data is by the mental health researchers Shepherd, Sanders, Doyle, and Shaw (2015). The researchers assessed how Twitter is used by individuals with experience of mental health problems by following the hashtag #dearmentalhealthprofessionals and conducting a thematic analysis to identify common themes of discussion. They found 515 unique communications that were related to the specified conversation. The majority of the material related to four overarching themes: (1) the impact of diagnosis on personal identity and as a facilitator for accessing care, (2) balance of power between professional and service user, (3) therapeutic relationship and developing professional communication, and (4) support provision through medication, crisis planning, service provision, and the wider society.

      Conclusion

      This chapter has addressed the role played by data in social science research and provided an overview of the advantages and limitations of digital data as a way to collect information from people in support of such human-centered research projects. The chapter has overviewed a number of online data sources, with forward pointers to Chapters 5 and 6, which specifically address aspects relevant to data collection and data sampling. Examples of social science research projects that make use of information obtained from digital resources were also provided, mainly as an illustration of the kind of research questions that can be answered with this kind of data; more such examples are provided in the following chapters (specifically in Chapters 10 through 12).

      Key Term

       Unstructured data 19

      Highlights

       Social science research has been traditionally conducted based on surveys, but new computational approaches have enabled the use of unstructured data sources as a way to learn information about people.

       Surveys are structured data sets that include clear, targeted information collected in controlled settings. They have the disadvantage of being expensive to run, which limits the frequency and number of surveys that can be collected for a study.

       Unstructured data sets are very large, “always on” naturally occurring digital resources, which can be used to extract or infer information on people. They have their own disadvantages, which include the fact that the information that can be obtained from these resources is often inexact and incomplete as well as subject to the biases associated with the groups of people who generate these data sources.

       Digital resources can be accessed either as collections available through institutional memberships (e.g., LexisNexis), via APIs provided by various platforms (e.g., Twitter API), or otherwise through scraping and crawling as described in Chapter 6.

      Discussion Questions

       Describe a social science research project that you know of, which has been based on survey data, and discuss how that same research project could be conducted using digital data resources. What kind of resources would you use? What kind of challenges do you expect to run into?

       While digital resources have their own advantages, as discussed in this chapter, there are certain types of information that cannot be collected from such unstructured data. Give examples of such types of information that can be collected only through surveys.

       Now consider a research project in which you would need to combine the benefits of unstructured data (e.g., Twitter) and structured data (e.g., surveys). In other words, your project requires that every subject in your data set provides both