The results of norm-referenced testing, therefore, depend greatly on the comparison of the individual child with the normative sample. At a minimum, comparisons are made based on children’s age. For example, on measures of intelligence, 9-year-old children must be compared to other 9-year-old children, not to 6-year-old children or to 12-year-old children. On other psychological tests, especially tests of behavior and social–emotional functioning, comparisons are made based on age and gender. For example, boys tend to show more symptoms of hyperactivity than do girls. Consequently, when a clinician obtains parents’ ratings of hyperactivity for a 9-year-old boy, he compares these ratings to the ratings for other 9-year-old boys in the normative sample (Achenbach, 2015).
Usually, clinicians want to quantify the degree to which children score above or below the mean for the normative sample. To quantify children’s deviation from the mean, clinicians transform the child’s raw test score to a standard score. A standard score is simply a raw score that has been changed to a different scale with a designated mean and standard deviation. For example, intelligence tests have a mean of 100 and a standard deviation of 15. A child with a FSIQ of 100 would fall squarely within the average range compared to other children his age, whereas a child with a FSIQ of 115 would be considered above average.
Reliability
Reliability refers to the consistency of a psychological test. Reliable tests yield consistent scores over time and across administrations. Although there are many types of reliability, the three most common are test–retest reliability, inter-rater reliability, and internal consistency (Hogan & Tsushima, 2018).
Test–retest reliability refers to the consistency of test scores over time. Imagine that you purchase a Fitbit to help you get into shape. You wear the Fitbit each morning while walking to your first class. If the number of steps estimated by the Fitbit is approximately the same each day, we would say that the Fitbit shows high test–retest reliability. The device yields consistent scores across repeated administrations. Psychological tests should also have high test–retest reliability. A child who earns a FSIQ of 110 should earn a similar FSIQ score several months later.
Inter-rater reliability refers to the consistency of test scores across two or more raters or observers. Imagine that you are affluent enough to own a Fitbit and a Garmin to measure your daily activity, one on each wrist. If the number of steps were similar for each device, we would say that the devices showed excellent inter-rater reliability; they agree with each other. Similarly, psychological tests should show high inter-rater reliability. For example, on portions of the WISC–V, psychologists assign points based on the thoroughness of children’s answers. If a child defines an elephant as an animal, she might earn 1 point, whereas if she defines it as an animal with four legs, a trunk, and large ears, she might earn 2 points. Different psychologists should assign the same points for the same response, showing high inter-rater reliability.
Internal consistency refers to the degree to which test items yield consistent scores. Imagine that you want to obtain an estimate of your physical activity using your Fitbit. You decide to measure activity in three ways: (1) using the Fitbit’s step count, (2) using GPS data, and (3) by manually recording your activity. If you exercise a lot that day, all three scores should be high, because they all measure the same construct (i.e., activity). On the other hand, if you are sedentary that day, all three scores should be low. Such data would indicate good internal consistency; items measuring the same construct should yield consistent results.
Psychological tests should also have high internal consistency. For example, the WISC–V verbal comprehension tests show very high internal consistency. Children with excellent verbal skills tend to answer most test items correctly, whereas children with lower verbal skills tend to struggle on these items. High internal consistency suggests that items on the verbal comprehension index measure the same construct (e.g., verbal comprehension) and not other constructs such as the child’s visual–spatial skills or memory.
Reliability can be quantified using a coefficient ranging from 0 to 1.0. A reliability coefficient of 1.0 indicates perfect consistency. What constitutes “acceptable” reliability varies depending on the type of reliability and construct the test is measuring. For example, tests that assess traits that are believed to be stable over time, such as FSIQ, should have high test–retest reliability. In contrast, tests that measure mental states that are likely to change over time, such as symptoms of depression or anxiety, may have lower test–retest reliability.
Validity
Validity refers to the degree to which a test accurately measures what it is designed to measure. Tests of intelligence should measure intelligence, tests of depression should measure depression, and so forth (Hogan & Tsushima, 2018).
Whereas reliability reflects consistency, validity refers to accuracy. Imagine a poor archer whose arrows are scattered across a target in a random pattern. The poor archer shows low reliability (as evidenced by his scattered arrows) and low validity (because he missed the bullseye). Now imagine a better archer whose five arrows are clustered together but distant from the bullseye. The better archer displays high reliability (as evidenced by his consistent pattern) but low validity (because he also missed the bullseye). Finally, imagine Katniss Everdeen, whose arrows are clustered together within the bullseye. Katniss shows high reliability and validity, like an ideal psychological test (Figure 4.5).
Figure 4.5 ■ Reliability and Validity
Note: Reliability refers to the consistency of test scores; validity refers to their accuracy. Psychological tests must yield consistent scores that accurately measure the construct they are designed to measure.
Technically speaking, validity is not a property of a test itself. Instead, validity refers to the degree to which the test can be accurately used to serve a specific purpose (Hogan & Tsushima, 2018). Imagine that you record your daily physical activity with your Fitbit for 1 month. Your daily activity level is very high (10,000 steps) and relatively consistent (i.e., reliable) each day. Consequently, you might conclude that you are in excellent health. However, when you visit your doctor, she tells you that you have high blood pressure and cholesterol. The Fitbit may be a reliable measure of activity, but it may not be a valid measure of overall health. Similarly, a child’s WISC–V visual-spatial reasoning score may provide an accurate estimate of her visual-spatial reasoning, but it is probably not an accurate indicator of her reading or writing skills.
The validity of psychological tests can be examined in at least three ways. First, psychologists can look at the content validity of the test. Specifically, the content of test items should be relevant to the test’s purpose. For example, the Children’s Depression Inventory, Second Edition (CDI-2; Kovacs, 2011) is the most widely used instrument to assess depression in children. The test includes items that reflect many of the diagnostic criteria for depression: “I am sad all the time,” “I am cranky all the time,” and “I have trouble sleeping every night.” The CDI-2 has excellent content validity because these items are consistent with the DSM-5 symptoms of depression.
Psychologists also examine the construct validity of the test (Cronbach & Meehl, 1955). Construct validity refers to the degree to which test scores reflect hypothesized attributes, or constructs. Most psychological variables are constructs: intelligence, depression, anxiety, aggression. Constructs cannot be measured directly; instead, they must be inferred from overt actions or people’s self-reports. For example, intelligence might be inferred from excellent grades in school, depression might be inferred from frequent crying, and aggression might be inferred from a history of physical fighting.
To investigate the construct validity of a test, psychologists examine the relationship of test scores to other measures of similar and dissimilar constructs. Evidence of convergent