In examining the development of writing ability, Wolfe‐Quintero et al. (1998) provided a comprehensive list of measures to examine written production. They defined complexity as “the use of varied and sophisticated structures and vocabulary” (p. 117) in production units. Grammatical complexity involved linguistic features (e.g., subordination) in clause, T‐unit, and sentence production units, measured by the frequency of occurrence of these features (e.g., clauses), by complexity ratios (e.g., T‐units per complexity ratio), and by complexity indices (e.g., coordination index). Lexical complexity was measured by lexical ratios (e.g., word variation/density). Accuracy referred to “error‐free production” (p. 117), measured by the frequency of occurrence of errors or error types within a production unit (e.g., error‐free T‐units), by ratios (e.g., error‐free T‐unit per T‐unit), or by accuracy indices (e.g., error index). Finally, they defined fluency as “the rapid production of language” (p. 117), measured by the frequency of occurrence of specified production units (e.g., words in error‐free clauses) or by fluency ratios (e.g., words per T‐unit or clause).
In a comprehensive book on the analysis of learner data, Ellis and Barkhuizen (2005) produced in one chapter an updated list of analytic measures of CAF (complexity, accuracy, and fluency). Complexity was characterized as interactional (e.g., number of turns, mean turn length), propositional (e.g., number of idea units), functional (e.g., frequency of language functions), grammatical (e.g., amount of subordination), whereas accuracy was measured by the number of errors per 100 words, the percentage of error‐free clauses, and so forth. Finally, fluency was defined as temporal variations (e.g., number of pauses) and hesitation phenomena (e.g., number of false starts).
Several L2 testers have also used production features to characterize different aspects of written and spoken text. Cumming et al. (2006) analyzed the discourse features of independent and integrated written tasks for a large‐scale standardized test. They examined whether and how the discourse features written for integrated tasks differed from those for independent essays. The texts were coded for lexical and syntactic complexity, grammatical accuracy, argument structure, orientation to evidence, and verbatim uses of source text (Figure 9). They found that the discourse features produced for the integrated writing tasks differed significantly at the lexical, syntactic, rhetorical, and pragmatic levels from those produced for the independent task. Also, significant differences were found among the three tasks across proficiency levels on several variables under investigation.
Knoch, Macqueen, and O'Hagan (2014) replicated and extended Cumming et al.'s (2006) study by examining, first, whether the written discourse features produced in independent and integrated writing tasks differed; and then, what features were typical of different scoring levels. The analysis focused on measures of accuracy, fluency, complexity, coherence, cohesion, content, orientation to source evidence, and metadiscourse (Figure 10). They found that the two types of tasks elicited significantly different language from the test takers and that the discourse features differed at the various score levels.
Figure 9 Discourse analytic measures (used in Cumming et al., 2006, used with permission)
Figure 10 Discourse analytic measures (used in Knoch et al., 2014, used with permission)
These measures provide important information on the characteristics of L2 production and could serve as evidence supporting validity claims (e.g., for extrapolation). However, it is unclear how production measures, such as the number of clauses per T‐unit or the average word length, can be interpreted within a model of L2 proficiency or how these measures can be used to provide actionable feedback for learners. That said, testing experts have effectively used production features to examine validity questions in the area of automated scoring and feedback systems (Chapelle, 2008). Several studies have, for example, correlated measures of production features with the extended, spontaneous production of written or spoken texts (Burnstein, van Moere, & Cheng, 2010), or have examined the use of e‐rater production measures as a complement to human scoring of learners' writing essays (Enright & Quinlan, 2010). Weigle (2010) found consistent correlations between both human and e‐rater scoring with non‐test indicators such as teacher or student self‐evaluation of writing. This work highlights the potential of using production features in automated scoring.
A third approach to measuring the linguistic resources of communication is through the use of developmental scores obtained from tasks that have been designed with developmental proficiency levels in mind (Purpura 2004; Ellis, 2005), the levels being derived from research on the role of orders and sequences in interlanguage development in SLA. For example, to investigate the potential of assessing knowledge of grammatical form by means of developmental levels for a computer‐delivered ESL test of productive grammatical ability, Chapelle, Chung, Hegelheimer, Pendar, and Xu (2010) first identified a corpus of morphosyntactic (simple past) and syntactic (SVO word order) forms along with their associated meanings (past achievement or accomplishment) as they corresponded to multiple stages of interlanguage development. They then used this information to design highly constrained LP tasks (e.g., respond by using one or more of the following words) and somewhat constrained EP tasks (e.g., begin your response with the following words). Scoring of the responses was then based on a three‐point scale from no evidence (0) to partial evidence (1) to full evidence of knowledge (2) of the form. Results showed three distinct developmental levels of the items. Also, the developmental scores produced moderate correlations with their placement test writing scores and with TOEFL iBT (Internet‐based Test of English as a Foreign Language) scores. While this study was well designed and the results remarkable, it focused only on the knowledge of semantico‐grammatical forms and meanings with no explicit measurement of how these forms served as a resource for communicating other meanings.
In another interesting study, Chang (2004) used developmental proficiency levels to design items that aimed to measure the difficulty order of English relative clauses with relation to the form, meaning, and pragmatic use dimensions. He found that when form alone was considered, the difficulty order of the relative clauses in his test generally matched that predicted in Keenan and Comrie's (1977) noun phrase accessibility hypothesis (NPAH), with some deviations. When the form and meaning of relative clauses were measured, the results in his test strongly supported the NPAH. However, when pragmatic use was considered, the difficulty order in Chang's test was different from that proposed in the NPAH. Interestingly, these results held when both developmental scores and accuracy scores were used. This study underscores the importance of moving beyond forms to a consideration of how semantico‐grammatical forms serve as a resource for communicating not only literal meanings, but also pragmatic meanings. Both studies also highlight the need to continue work with developmental scores.
Conclusion
In examining how the linguistic resources of L2 proficiency have been conceptualized and assessed in L2 assessment and SLA, this entry has argued that semantico‐grammatical knowledge defined uniquely in terms of forms and meanings presents only a partial understanding of the linguistic resources. A fuller definition of linguistic resources of communication needs to include the form‐meaning mappings that are intrinsically related to the conveyance of propositional and pragmatic meanings. It also argues that in the assessment of communicative effectiveness, the focus might be better placed on the use of grammatical ability for meaning conveyance, rather than