Where information is missing in these commercial data products, it is often imputed or simulated from other data subjects with similar characteristics. Attitude profiling is also used where demographic information is missing. This involves profiling individuals by combining responses to multiple attitudinal questions. These imputation processes can lead to questions of data accuracy. However, there is only limited research in this area. These commercial data sources are available with short lead times and techniques have been developed to combine different types of data often involving large numbers of individuals and variables. Individual record data matched to postcodes and/or individual names can be purchased at relatively low cost.
Commercial data product suppliers are also now providing access to social media data including Twitter posts and analysis of such sources as YouTube, long-form blogs and Facebook, as well as data from online game playing. It is to these types of data we now turn, including what we consider consequential, self-published and trace data.
2.2.6 New Data Types and Approaches
Social media is increasingly seen as an invaluable source of data for social research. Research use of social media data might involve the textual analysis of micro-blogs such as Twitter to code for attitudes (see Chapter 8), explore networks (see Chapter 10) and contextual patterns, and to track movements using user name, time of posting, geography and network links (see Chapter 9). An example of this, which we consider in more detail in Chapter 3, was the collection and coding of 2.6 million Twitter postings during the civil disturbances in England in 2011. The data was used to examine patterns of communication during the riots and content analysis of rumour patterns (see Lewis et al., 2011; Procter et al., 2013a; 2013b). A second example of social media data use involves a study of videos posted on YouTube in response to the release of a film criticizing Islam. The analysis involved examining the nature and content of comments and the links to other uploaded videos and postings to map the scale of the protests and the nature of the dialogue (Van Zoonen et al., 2011). Other studies have examined the nature of political protest by examining online postings and discussions; see, for example, Bowman-Grieve and Conway’s (2012) research into dissident Irish Republicanism.
Social media data can also be used in a more exploratory way for social science research. For example, purposive sampling techniques can support the development of research ideas and the testing of concepts. A recent example of such an approach, entitled The Everyday Sexism Project, involved the development of a website where the public were invited to report experiences of sexism (Bates, 2014).35 The data has strengths and weaknesses. It provides an insight into, and examples of, reported sexual harassment. However, the sample is limited. There is no verification and there can be no straightforward extrapolation to provide a measure of prevalence, though the evidence suggests prevalence is significant. A more robust research design might ask respondents to report key demographics, change over time and to describe how their experiences compare with those of people they know in their social networks.
Another functional difference of many of the new data types is that the data is often generated directly by individuals and organizations themselves for their own purposes. A recent example of what we term self-generated data involves a UK police force using Twitter to announce all the emergency calls it received in order to highlight their work over a given period.36 This constitutes a potentially rich source of data for social science research, produced not via a traditional social science research design but self-reported by an organization. We consider this type of data in more detail below.
The United Nations is embracing the potential of digital evidence in relation to human rights and also examining policy impacts in almost real time.37 This can take the form of: monitoring food price discussions, money transfer patterns via mobile phone, or tracking health concerns expressed through Twitter or captured in Internet search records. Such techniques can include feedback loops where people’s attitudes and behaviour can be followed up. The data can be used to conduct almost real time research. However, data quality assessment practices need to be used and the verification of data still needs to take place. Quality assurance mechanisms are being developed which involve volunteers validating data.
There is a link here to what is termed citizen science and crowdsourcing, whereby people voluntarily allow the collation of their own data, or contribute data that they gather themselves, and also undertake data processing and coding. Data is being generated by citizens not only about themselves but also about issues they might have an interest in. For example, in the Satellite Sentinel Project, citizens are being asked to volunteer to observe and code images for evidence of human rights abuses (e.g. military activity or signs of explosions) in Sudan sourced from a network of private satellites.
In the changing data environment, there are increasingly detailed records of actual behaviour accumulating alongside survey data on people’s reported behaviour, and there is scope for reporting and monitoring behaviour in almost real time to complement more traditional social science research data gathered retrospectively through surveys, interviews and diaries. For example, purchase data collected by supermarkets could be used alongside food diaries; mobile phone movement data could be used alongside self-recorded time use data; and health monitoring data could be used alongside surveys of people’s self-reported health.
The step change for social science research lies in the potential, where appropriate, for identifying and bringing together the different data types described in our eight-point typology: orthodox intentional data, participative intentional data, consequential data, self-published data, social media data, trace data, found data and synthetic data. Synthetic data can be used as part of simulation and agent-based studies. For an example, see the recent UK project on the Social Complexity of Diversity, which uses computer-based simulations, and Chapter 6 in this volume.38
Moreover, almost real time data opens up opportunities for what may be termed ‘real time’ or ‘live’ social science, though this clearly challenges standard practices and timescales in research for data quality assurance and for peer review.
2.3 Combining Data and Mixed Methods – Key Research Areas
A useful way to examine and understand the new types of data in context is by comparison with existing forms of social science research data. Through a series of broad social science research policy areas, we will now consider some orthodox intentional data sources and research designs (such as social surveys) and identify other new data sources that might be used in combination with them as part of mixed methods studies. We compare different types of variables in each of the key areas. In doing so, we will use the UK as an example country, whilst acknowledging that data environments vary across countries and that some data sources transcend national boundaries.
2.3.1 Data on Economic Circumstances
Key sources for capturing data on people’s economic circumstances in the UK include the Census39 and the Labour Force Survey (LFS).40 The Census provides a profile of the UK population every 10 years. It collects information on people’s employment, health and family circumstances. It is a key tool in estimating population change. Data from the Census is available in summary tables as well as in samples of microdata. Specific data tables can also be requested for an administration fee. The questions on economic circumstances are, however, limited. It is notable that it is anticipated