Measurement of a speaking performance, however, requires a different kind of scale, such as those used in certain sports competitions (see Spolsky, 1995) where the quality of performance is based on rank. There is no equal‐interval unit of measurement comparable to ounces or pounds that allows the precise measurement of a figure skating performance. Likewise, assessing speaking ranks students into ordinal categories (often referred to as vertical categories) similar to bronze, silver, and gold; A2, B2, C2; or beginning, intermediate, and advanced.
The global assessment of performance is associated with holistic scales where abilities are represented by level descriptors comprised of a qualitative summary of the raters' observations. Benchmark performances are selected to exemplify the levels and their descriptors. Scale descriptors are typically associated with, but not limited to: pronunciation (focusing on segmentals); phonological control (focusing on suprasegmentals); grammar/accuracy (morphology, syntax, and usage); fluency (speed and pausing); vocabulary (range and idiomaticity); coherence; and organization. If the assessment involves evaluation of interaction, the following may also be included: turn‐taking strategies, cooperative strategies, and asking for or providing clarification.
Holistic vertical indicators, even when accompanied by scale descriptors and benchmarks, may not be sufficient for making instructional or placement decisions. In such cases, an analytic rating, discourse analysis, or extraction of temporal measures of fluency may be conducted to explicate components of the examinee's performance. The specific components chosen, which can include any of the same aspects of performance used in holistic scale descriptors, depend on the purpose of the test, the needs of the score users, and the interests of the researcher.
Score users are central in Alderson's (1991) distinction among three types of scales: constructor oriented, assessor oriented, and user oriented. The language used to describe abilities tends to focus on the positive aspects of performance in user‐oriented and constructor‐oriented scales, where the former may focus on likely behaviors at a given level and the latter may focus on particular tasks associated with a curriculum or course of instruction. Assessor‐oriented scales shift the focus from the learner and objectives of learning toward the rater; scale descriptors are often negatively worded and focus on the perceptions of the rater and are often more useful for screening purposes.
From another perspective, scales for speaking assessments can be theoretically oriented, empirically oriented, or both. The starting point for all assessments is usually a description or theory of language ability (e.g., Canale & Swain, 1980; Bachman, 1990). These broad orientations are then narrowed down to focus on a particular skill and particular components of that skill. Empirical approaches to the development and validation of speaking assessment scales involve identification of characteristics of interest for the subsequent development of scale levels (e.g., Chalhoub‐Deville, 1995; Fulcher, 1996) or explications of assigned ability levels (e.g., Xi & Mollaun, 2006; Iwashita, Brown, McNamara, & O'Hagan, 2008; Ginther, Dimova, & Yang, 2010), or both. In addition, collecting data about examinee perspectives and experiences can be used to improve test development and administrative procedures (Yan, Thirakunkovit, Kauper, & Ginther, 2016).
Specific‐purpose scales are often derived from general guidelines and frameworks. For example, the ACTFL proficiency guidelines (2009) serve as a starting point for the ACTFL OPI scale. Another influential framework is the Common European Framework of Reference for Languages (CEFR; Council of Europe, 2001). The CEFR is a collection of descriptions of language ability, ranging from beginning to advanced, across and within the four main skills. The document is comprehensive and formidable in scope, but, in spite of its breadth, the CEFR has been used to construct scales for assessing language performance, to communicate about levels locally and nationally (Figueras & Noijons, 2009), and to interpret test scores (Tannenbaum & Wylie, 2008).
Raters
In order for raters to achieve a common understanding and application of a scale, training is necessary. As the standard for speaking assessment procedures involving high‐stakes decisions is an inter‐rater reliability coefficient of 0.80, some variability among raters is expected and tolerated. Under optimal conditions, the sources of error that can be associated with the use of a scale are expected to be random rather than systematic.
One type of systematic error results from a rater's tendency to assign either harsh or lenient scores. When a pattern is identified in comparison to other raters in a pool, a rater may be identified as negatively or positively biased. Systematic effects with respect to score assignment have been found in association with rater experience and native language background, and also examinee native language background (Ross, 1979; Brown, 1995; Chalhoub‐Deville, 1995; Chalhoub‐Deville & Wigglesworth, 2005; Winke, Gass, & Myford, 2011; Yan, 2014; Yan, Cheng, & Ginther, 2019). Every effort should be made to identify and remove biased raters, as their presence negatively affects the accuracy, utility, interpretability, and fairness of the scores we report (see Wind & Peterson, 2018).
While these findings underscore the importance of rater training (see Yan & Ginther, 2017), its positive effects may be shortlived (Lumley & McNamara, 1995). Raters drift over time, and so the practice of certifying raters once and for all is problematic. The most effective rater training procedures include calibration and regular training sessions.
A more frequent concern raised by studies of rater variability—one that can only be partially addressed by rater training—is whose standard is the most appropriate to apply when developing assessments and scales. Ginther and McIntosh (2018) summarize:
Work on World Englishes (WE)… has challenged the notion of an ideal native‐speaker, long promoted in theoretical and applied linguistics, and helped to legitimize varieties other than standard British or American English. Meanwhile, English as a lingua franca (ELF) scholars like Seidlhofer (2001) and Jenkins (2006) have advocated for a more flexible contact language that could serve as a communicative resource for so‐called “non‐native” and “native” speakers alike. Both traditions have criticized language tests, especially large‐scale ones like TOEFL, that continue to use native English speaker (NES) norms as the basis for items and assessment, despite the fact that non‐native English speakers (NNES) are now the majority (Davidson, 2006). (p. 860)
Dimova (2017) discusses the implications for the inclusion of a broader variety of speakers in relation to both ELF and WE. Ockey, Papageorgiou, and French (2015) and Ockey and French (2016) discuss performance effects on listeners and speakers. Harding (2017) argues, in a review of validity concerns for speaking assessments, that the time has come for listener variables to be considered in construct definitions.
Conclusion
Each successive expansion of the construct of language proficiency has added to its richness in terms of description and explication (Ginther & McIntosh, 2018). This capacity to grow and adapt points toward a bright future for the field of language testing, and especially for speaking assessment.
SEE ALSO: Assessment of Listening; Assessment of Reading; Assessment of Vocabulary; Assessment of Writing; Automatic Speech Recognition; Intelligibility in World Englishes; Rating Oral Language; Rating Scales and Rubrics in Language Assessment
References
1 Alderson, J. C. (1991). Bands and scores. In J. C. Alderson & B. North (Eds.), Language testing in the 1990s: The communicative legacy (pp. 71–86). London, England: Macmillan.
2 American Council on the Teaching of Foreign Languages (ACTFL). (2009). Testing for proficiency: The ACTFL oral proficiency