These descriptive statistics are used to summarize what was observed in the research. But the idea of a lot of research is to generalize the findings beyond just the observations or participants in the study. We ultimately want to say something about behavior in general, not just the behavior that occurred in the study. To make these generalizations, we need inferential statistics. Before leaping into a list of the various inferential statistics you will likely come across in the literature, we would like to review some of the basic concepts of inference.
Inferential Statistics
Inferential statistics are used to generalize the findings of a study to a whole population. An inference is a general statement based on limited data. Statistics are used to attach a probability estimate to that statement. For example, a typical weather forecast does not tell you that it will rain tomorrow afternoon. Instead, the report will indicate the probability of rain tomorrow. Indeed, the forecast here for tomorrow is a 60% chance of rain. The problem with making an inference is that we might be wrong. No one can predict the future, but based on good meteorological information, an expert is able to estimate the probability of rain tomorrow. Similarly, in research, we cannot make generalized statements about everyone when we only include a sample of the population in our study. Instead, we attach a probability estimate to our statements.
When you read the results of research articles, the two most common uses of inferential statistics will be hypothesis testing and confidence interval estimation.
Does wearing earplugs improve test performance?
Is exercise an effective treatment for depression?
Is there a relationship between hours of sleep and ability to concentrate?
Are married couples happier than single individuals?
These are all examples of research hypotheses that could be tested using inferential tests of significance. What about the following?
Does the general public have confidence in its nation’s leader?
How many hours of sleep do most adults get?
At what age do most people begin dating?
These are all examples of research with a focus on describing attitudes and/or behavior of a population. This type of research, which is more common in sociology than in psychology, uses confidence interval estimation instead of tests of significance.
The vast majority of psychological research involves testing a research hypothesis. So let’s first look at the types of tests of significance you will likely see in the literature and then look at confidence intervals.
Common Tests of Significance.
Results will be referred to as either statistically significant or not statistically significant. What does this mean? In hypothesis-testing research, a straw person argument is set up where we assume that a null hypothesis is true, and then we use the data to disprove the null and thus support our research hypothesis. Statistical significance means that it is unlikely that the null hypothesis is true, given the data that were collected. Nowhere in the research article will you see a statement of the null hypothesis; instead, you will see statements about how the research hypothesis was supported or not supported. These statements will look like this:
With an alpha of .01, those wearing earplugs performed statistically significantly better (M = 35, SD = 1.32) than those who were not (M = 27, SD = 1.55), t(84) = 16.83, p = .002.
The small difference in happiness between married (M = 231, SD = 9.34) and single individuals (M = 240, SD = 8.14) was not statistically significant, t(234) = 1.23, p = .21.
These statements appear in the results section and describe the means and standard deviations of the groups and then a statistical test of significance (t test in both examples). In both statements, statistical significance is indicated by the italic p. This value is the p value. It is an estimate of the probability that the null hypothesis is true. Because the null hypothesis is the opposite of the research hypothesis, we want this value to be low. The accepted convention is a p value lower than .05 or, better still, lower than .01. The results will support the research hypothesis when the p value is lower than .05 or .01. The results will not support the research hypothesis when the p value is greater than .05. You may see a nonsignificant result reported as ns with no p value included.
You will find a refresher on statistical inference, including a discussion of Type I and Type II errors, and statistical power in Chapter 4.
Researchers using inferential techniques draw inferences based on the outcome of a statistical significance test. There are numerous tests of significance, each appropriate to a particular research question and the measures used, as you will recall from your introductory statistics course. It is beyond the scope of our book to describe in detail all or even most of these tests. You might want to refresh your memory by perusing your statistics text, which of course you have kept, haven’t you? We offer a brief review of some of the most common tests of significance used by researchers in the “Basic Statistical Procedures” section of this chapter.
Going back to the results section of our example article, we see that the author has divided that section into a number of subsections. The first subsection, with the heading “Mood,” reports the effect of light on mood. It is only one sentence: “No significant results were obtained” (Knez, 2001, p. 204). The results section is typically brief, but the author could have provided the group means and the statistical tests that were not statistically significant. The next subsection, titled “Perceived Room Light Evaluation,” provides a statistically significant effect. Knez (2001) reports a significant (meaning statistically significant) gender difference. He reports Wilks’ lambda, which is a statistic used in multivariate ANOVA (MANOVA; when there is more than one DV), and the associated F statistic and p value for the gender difference, F (7, 96) = 3.21, p = .04. He also includes a figure showing the mean evaluations by men and women of the four light conditions and separate F statistics and p values for each condition.
In the subsections that follow, Knez (2001) reports the results and statistical tests for the effect of light condition on the various DVs. He reports one of the effects as a “weak tendency to a significant main effect” (p. 204) with a p value of .12. We would simply say that it was not statistically significant, ns. Indeed, many of Knez’s statistical tests produced p values greater than .05. We bring this to your attention as a reminder that even peer-reviewed journal articles need to be read with a critical eye. Don’t just accept everything you read. You need to pay attention to the p values and question when they are not less than .05. You also need to examine the numbers carefully to discern the effect size.
What is noticeably missing from the results section of Knez (2001), our example article, is a calculation of effect size. Effect size gives us some indication of the strength of the effect (see Chapter 4 for more detail). Remember, statistical significance tells us that an effect was likely not due to chance and is probably a reliable effect. What statistical significance does not indicate is how large the effect is. If we inspect the numbers in Knez’s article, we see that the effects were not very large. For example, on the short-term recall task, the best performance was from the participants in the warm-lighting conditions. They had a mean score of 6.9 compared with the other groups, with a mean score of about 6.25. A difference of only 0.65 of a word on a recall task seems like a pretty small effect, but then