a cultural, technological, and scholarly phenomenon that rests on the interplay of: (1) Technology: maximizing computation power and algorithmic accuracy to gather, analyze, link, and compare large data sets. (2) Analysis: drawing on large data sets to identify patterns in order to make economic, social, technical, and legal claims. (3) Mythology: the widespread belief that large data sets offer a higher form of intelligence and knowledge that can generate insights that were previously impossible, with the aura of truth, objectivity, and accuracy.
(boyd and Crawford, 2012, p. 664)
From a critically sociological perspective, Lupton (2014, p. 101) argues that the hype that surrounds the new technological possibilities afforded by big data analytics contribute to the belief that such data are ‘raw materials’ for information – that they contain the untarnished truth about society and sociality. In reality, each step of the process in the generation of big data relies on a number of human decisions relating to selection, judgement, interpretation, and action. Therefore, the data that we will have at hand are always configured via beliefs, values, and choices that ‘“cook” the data from the very beginning so that they are never in a “raw” state’. So, there is no such thing as raw data, even though the orderliness of neatly harvested and stored big datasets can create an illusion to the contrary.
Sociologist David Beer (2016, p. 149) argues that we now live in ‘a culture that is shaped and populated with numbers’, where trust and interest in anything that cannot be quantified diminishes. Furthermore, in the age of big data, there is an obsession with causation. As boyd and Crawford (2012, p. 665) argue, the mirage and mythology of big data demand that a number of critical questions are raised with regard to ‘what all this data means, who gets access to what data, how data analysis is employed, and to what ends’. There is a risk that the lure of big data will sideline other forms of analysis, and that other alternative methods with which to analyse the beliefs, choices, expressions, and strategies of people are pushed aside by the sheer volume of numbers. ‘Bigger data are not always better data’, they write, and the analysis of them will not necessarily lead to insights about society that are more true than what can be achieved through other data and methods.
Many popular examples exist for illustrating how datafication is growing exponentially intense, the most famous one being Moore’s Law, according to which computers and their memory and storage will become ever more powerful by each unit of time (Moore, 1965). Another telling comparison is this one: The Great Library of Alexandria, which was established in the third century BCE, was regarded as the centre of knowledge in the ancient world. It was believed to hold within it the sum total of all human knowledge. Its entire collection has been estimated by historians to have been the size of 1,200 million terabytes. Today however, we have enough data in the world to give more than 300 times as much data to each person alive (Cukier and Mayer-Schoenberger, 2013).
We are no doubt in the midst of an ongoing data explosion, and along with it the development of ‘data science’. Data science is an interdisciplinarily oriented specialisation at the intersection of statistics and computer science, focusing on machine learning and other forms of algorithmic processing of large datasets to ‘liberate and create meaning from raw data’ rather than on hypothesis testing (Efron and Hastie, 2016, p. 451). Data science is a successor to the form of ‘data analysis’ proposed by the statistician John W. Tukey, whose analytical framework focused on ‘looking at data to see what it seems to say’, making partial descriptions and trying ‘to look beneath them for new insights’. In his exploratory vein, Tukey (1977, p. v) also emphasised that this type of analysis was concerned ‘with appearance, not with confirmation’. This focus on mathematical structure and algorithmic thinking, rather than on inferential statistical justification, is a precursor to the flourishing of data science in the wake of datafication.
All the things that people do online in the context of social media generate vast volumes of sociologically interesting data. Such data have been approached in highly data-driven ways within the field of data science, where the aim is often to get a general picture of some particular social pattern or process. Being data-driven is not a bad thing, but there must always be a balance between data and theory – between information and its interpretation. This is where sociology and social theory come into the picture, as they offer a wide range of conceptual frameworks, theories, that can aid in the analysis and understanding of the large amounts and many forms of social data that are proliferated in today’s world.
But in those cases where we see big data being analysed, there is far too often a disconnect between the data and the theory. One explanation for this may be that the popularity and impact of data science makes its data-driven ethos spill over also into the academic fields that try to learn from it. This means that we risk forgetting about theoretical analysis, which may fade in the light of sparkling infographics.
It is my argument that the social research that relies heavily on the computational amassing and processing of data must also have a theoretical sensitivity to it. While purely computational methods are extremely helpful when wrangling the units of information, the meanings behind the messy social data which are generated in this age of datafication can be better untangled if we also make use of the rich interpretive toolkit provided by sociological theories and theorising. The data do not speak for themselves, even though some big data evangelists have claimed that to be the case (Anderson, 2008).
Big data and data science are partly technological phenomena, which are about using computing power and algorithms to collect and analyse comparatively large datasets of, often, unstructured information. But they are also most prominently cultural and political phenomena that come along with the idea that huge unstructured datasets, often based on social media interactions and other digital traces left by people, when paired with methods like machine learning and natural language processing, can offer a higher form of truth which can be computationally distilled rather than interpretively achieved.
Such mythological beliefs are not new, however, as there has long been, if not a hierarchy, at least a strict division of research methods within the cultural and social sciences, where some methods – those that have come to be labelled ‘quantitative’, and that analyse data tables with statistical tools – have been vested with an ‘aura of truth, objectivity, and accuracy’ (boyd and Crawford, 2012, p. 663). Other methods – those commonly named ‘qualitative’, and involving close readings of textual data from interviews, observations, and documents – are seen as more interpretive and subjective, rendering richer but also (allegedly) more problematic results. This book rests on the belief that this distinction is not only annoying, but also wrong. We can get at approximations of ‘the truth’ by analysing social and cultural patterns, and those analyses are by definition interpretive, no matter the chosen methodological strategy. Especially in this day and age where data, the bigger the better, are fetishised, it is high time to move on from the unproductive dichotomy of ‘qualitative’ versus ‘quantitative’.
Data theory
Pure data science tends to focus very strongly simply on what is researchable. It goes for the issues for which there are data, no matter if those issues have any real-life urgency or not. The last decade