When they are properly phrased, statistical results are hard to disprove because they do not contain absolutist language. This built-in ambiguity can be frustrating to those who want a strict yes-or-no answer, especially after they have waited until the data are collected and analyzed, sometimes at great cost. Unfortunately for these individuals, statistics are not meant to suit the convenience of the moment. Only so much can be properly inferred from a system of educated guesses, regardless of how carefully the guesses are made.
It would not be much of an exaggeration to say that the world is run by statistics, or at least by people who get statistical information upon which to base their decisions. With the communications industry almost omnipresent in day-to-day life, people’s interest in and need to understand statistics has unnoticeably mushroomed into a dominant feature of modern life. Why? Because statistics are used to answer people’s questions, and the answers reported (through TV, radio, newspapers, etc.) use statistics as evidence. Remember, research needs samples, and samples generate statistics.
In our day-to-day lives, we use statistics without even knowing it. Those of us who own and drive a car guess whether we can make it to the next gas station based on what we know of the road conditions, typical gas mileage (a statistical issue, to be sure), and the consequence of being wrong should we run out of gas while on the way to where we are planning to go. Nondrivers make equally statistical estimations and decisions, such as how much cash to keep on hand for a given weekend’s planned activities.
In our first encounter with the high school principal, we see that he has data that are capable of addressing a wide variety of relevant academic and social questions. His questions are mostly about current academic achievement, but some are more future-oriented. He wants to use his data to support budgeting and expansion activities as well as to identify current problems for more immediate attention. He wants to become known for using science and statistics in his professional life due to his ambition to become a district superintendent someday.
The principal considers all of his data to be samples, even when a casual observer might suggest otherwise. He wants to generalize to other classes and years. His 20 years at the school have shown him that changes in the makeup of each student class occur very slowly and subtly from year to year. For him, statistics are safer than having to take a harder stance. He likes to fall back on statistics being a science of quantifying ambiguity, and that means that his answers will be somewhat ambiguous, too.
The director of public health will have access to very large and mostly representative samples to address her questions. Although her questions will be answered through her statewide electronic database, people are missing, for a variety of reasons, throughout the data. Yet her data are more representative than most and are likely to be consistent from year to year, even given any unknown sources of bias. Given the importance of year-to-year comparisons and the need to be sensitive to people moving in and out of the medical assistance program over time, she is quite pleased with the representativeness, completeness (the relative lack of missing data), and comprehensiveness (the availability of measures to address important characteristics, for her questions) of her data. For her, ambiguity is part of what gives her the luxury of testing different hypotheses for the potential impact of changes in public health policy. It has taken her a long time to gain professional credibility in a previously male-dominated field, and her reliance on carefully considered statistical methodologies are a central part of her strategy for maintaining and extending her leadership in the field. The changes in medical insurance and health care delivery since the introduction of the Affordable Care Act (ACA) have created a situation where she needs to adjust her policies and programs to accommodate the added influx of Medicaid beneficiaries within her databases. This changing landscape of people and policies adds a layer of ambiguity to her job that could not have been foreseen when she accepted her position with the State.
Yet ambiguity holds one of the keys to the path of statistical knowledge. The ambiguity in the system means that no known solution fits perfectly. Ah, the challenge! Think of it! What is the best solution? Is the best solution the one with the smallest errors, or is it the most parsimonious one? What data are available? How good are they? Judgments fly. Decisions are made. The challenge in defining a statistical solution often is not to be the most correct but, instead, the least wrong! How? By means of a sharp question that cuts straight through ambiguity associated both with the statistical and methodological approaches and with the data themselves.
The overarching message about statistics is that they are uncertain. Treated that way, statistics become more of an intellectual challenge, with less attachment and much less certainty. So, continue to relax and marvel at some close-up views of the foundation of statistics. Look at where the cracks are. Realize what those cracks could mean for edifices built on top. Smile, as together we experience more about this view of the world.
3. Fodder—Data
Observe
Record
More
Data are what we hear, see, smell, taste, touch, and more. Data can even be what we sense. Data can represent anything and everything that we can discriminate well enough to distinguish from something else. In short, if it can be perceived, it can be coded and used as data.
Data are the fodder of measurement, the backbone of statistics. Through a context, data become transformed into information. That context is a fusion of substantive knowledge of a topic with a methodological approach to gathering the data and the statistics used to derive meaning. A large part of the misuse of statistics is a nonreflective, uncritical crunching of numbers (i.e., data) to generate other, somewhat context-free, numbers. These uncritically examined results are then granted trusted status based on unfounded validity (discussed later in some detail). The result could be a poor decision or an ineffective policy, yet the statistics eventually are blamed. To become useful information, good data need to be placed in relevant contexts, with clear understandings of the strengths and weaknesses of the statistics and results.
This relevant context is the frame of reference from which relative meaning is derived. To know whether something is big or small, there needs to be a question of compared with what? A blue whale is small compared with the planet. An ant is huge compared with atoms. This same issue of needing a frame of reference, a comparison point, is important to most types of knowledge that might be acquired through statistics. Several types of frames of reference exist in statistics, as we will see.
One brief side note on the word data: Data is a plural word. Until very recently, the only proper grammatical use was as a plural noun, such as geese. Correctly, then, data are transformed through a context into information. A single piece of data is called a datum. With all that said, a recent English dictionary has recognized the common use of data as a singular noun and grants that use as a secondary preference.
Modern databases can contain dozens of gigabytes of information—an amount that is truly staggering to consider. High-speed office computers can need hours just to run through the data once. Census data are now available across the Internet. From course catalogs to recent golf scores to real-time stock prices, data surround us as oceans surround fish. Data are everywhere and generally too common even to notice.
Here is where the tao of statistics starts to take shape. Curiosity gives birth to questions that create the need for data that come from measures that people design to create meaning. We open our eyes with questions and perceive contextually rich data as probabilistic answers. Depending on how we ask our questions, how we look for and process the data, and how we place results in a meaningful context, the