Two-tailed significance test: When the critical region for significance is spread across both tails of the distribution.
One-tailed significance test: When the critical region for significance is limited to one tail of the distribution (more lenient than two-tailed tests).
Both Type I and Type II errors should be avoided, but Joseph Simmons et al. (2011) view Type I as more problematic. Their argument is once these false positives or incorrect statements that a finding exists appear in the literature, they tend to stay there. Such findings can encourage other researchers to investigate the phenomenon further, which may be a waste of resources.
Sources of Type I Error and Remedies
Remember that in inferential statistics, we are estimating likelihoods or probabilities that the data represent the true situation, but we set that likelihood at a given level, called the alpha level. By convention, the alpha level is set at .05. What that means is that there are only 5 opportunities in 100 (5 / 100) that we are mistaken in saying that our results are significant when the null hypothesis is true (see also Chapter 2 on this topic).
Standard approaches to reduce the likelihood of a Type I error are as follows: adjusting the alpha level; using a two-tailed versus a one-tailed test; and using a Bonferroni adjustment for multiple analyses (dividing the alpha level by the number of statistical tests you are performing). In theory, you could set your alpha level at a more stringent level (e.g., .01) to avoid a Type I error, but most researchers do not, fearing that a Type II error will occur.
A second approach is using a two-tailed rather than a one-tailed significance test. Please note that the decision to use a one- versus a two-tailed test is made prior to conducting analyses (and is typically indicated in the hypotheses). The difference between a two-tailed significance test and a one-tailed significance test deals with how your alpha is distributed. In a two-tailed test, the alpha level (.05) is divided in two (.025), meaning that each tail of the test statistic contains .025 of your alpha. Thus, the two-tailed test is more stringent than a one-tailed test because the critical region for significance is spread across two tails, not just one. A one-tailed test is not adopted in practice unless your hypothesis is stated as uni-directional rather than as bi-directional. Again, that decision has to be made prior to conducting analyses.
Another way in which a Type I error occurs is the use of multiple statistical tests with the same data. This situation may happen in research because there are not enough participants with specific demographic characteristics to run a single analysis. Here’s an example. Suppose you want to examine issues involving number of majors (e.g., for students who identified themselves as having one major, two majors, or three majors), class year (first vs. fourth year), and athlete status (varsity athlete [A] or not [NA]) and the dependent variables of interest were GPA and career indecision (see Figure 3.6).
What this table shows is that we have 12 cells (look at the bottom row) to fill with participants with the appropriate characteristics. A minimum number of participants per cell might be 15 individuals. But we don’t need just any 15 individuals; each cell must be filled with the 15 who have the required characteristics for that cell. For Cell 1, we need 15 students who identify as having one major, are in their first year, and are varsity athletes. Cell 2 is 15 students who identify as having one major, are in their first year, and are not varsity athletes. You can see the difficulty in filling all of the cells with students who have the sought-after characteristics. A single analysis might not be possible. We might have to ignore the athlete status in one analysis and look at class year and number of majors, which is six cells (against the DVs, GPA, and career indecision). In another analysis, we might look at number of majors (three) and athlete status (two) against the DVs (six cells again). In the full analysis, testing all variables at once, you would have 2 (class year) × 2 (athlete status) × 3 (number of majors) = 12 cells. Thus, if we did only two of those at a time (and given the selected examples) we would have fewer cells than in the full analysis. In a third analysis, we would look at athlete status (two) and class year (two), which has four cells. If we did all of these analyses, we would have run three analyses instead of one. The likelihood that a Type I error would occur has increased because of these multiple tests. For that reason, many researchers recommend using a Bonferroni adjustment, which resets the alpha level (more stringently). To use a Bonferroni adjustment, you divide the conventional alpha level (.05) by the number of tests you have run (here 3) to produce a new alpha level—here .017. Now, to consider a finding significant, the result would have to meet the new (more stringent) alpha level. (For those who want to read in more detail about this issue, articles by Banerjee et al. [2009] and by Bender and Lange [2001] may be helpful.)
Figure 3.6 Example of Using Multiple Statistical Tests With the Same Data
Revisit and Respond 3.2
In your own words, explain the difference between Type I and Type II errors.
Type II Errors: Sample Size, Power, and Effect Size
In a Type II error, we fail to reject the null hypothesis when we should have done so. Often the problem is having too few participants; therefore, having an adequate sample size is the primary way to address this problem. In general, larger sample sizes produce more power (see next section). Power is the ability to evaluate your hypothesis adequately. Formally, power is the probability of rejecting Ho (the null hypothesis), assuming Ho is false.
Power: Probability of rejecting Ho (the null hypothesis), assuming Ho is false.
When a study has sufficient power, you can adequately test whether the null hypothesis should be rejected. Without sufficient power, it may not be worthwhile to conduct a study. If findings are nonsignificant, you won’t be able to tell whether (a) you missed an effect or (b) no effect exists.
There are several reasons why you might not be able to reject the null hypothesis, assuming Ho is false. Your experimental design may be flawed or suffer from other threats to internal validity. Internal validity refers to whether the research design enables you to measure your variables of interest rigorously. All aspects of your research may pose threats to internal validity, such as equipment malfunction, participants who talk during the experiment, or measures with low internal consistency (see Chapters 2 and 5). Low power is another threat to internal validity.
Four factors are generally recognized as impacting the power of the study. In discussing these, David Howell (2013, p. 232) listed (1) the designated alpha level, (2) the true alternative hypothesis (essentially how large the difference between Ho and H1 is), (3) the sample size, and (4) the specific statistical test to be used. In his view, sample size rises to the top as the easiest way to control the power of your study.
Alternative hypothesis: Hypothesis you have stated will be true if the null hypothesis is rejected.
Power is associated with effect size (as defined in Chapter 2; Cohen, 1988), which is discussed next. Effect size describes what its label suggests: whether an intervention of interest has an impact or effect. Consider two means of interest (when Ho is true and when Ho is false) and the sampling distribution of the populations from which they were drawn. What is their overlap? If they are far apart and there is little overlap, you have a large effect size; if they are close together and there is a lot of overlap, you have a small effect size. Effect size is indicated by (d) and represents the difference between means in standard deviation units. Statistical programs generally have an option for providing