where the 3-gram probabilities are estimated using 3-gram frequencies observed in the k-grade documents and smoothing techniques to account for unobserved events. Given the probabilities of a sequence w in the different models (one per grade), a likelihood ratio of sequence w is defined as:
where the prior p(k) probabilities can be assumed to be uniform. The LR(w, k) values already give some information on the likelihood of the text being of a certain complexity or grade. Additionally, the authors use perplexity as an indicator of the fit of a particular text to a given model where low perplexity for a text t and model m would indicate a better fit of t to m. Worth noting is the reduction of the features of the language models based on feature filtering by information gain (IG) values to 276 words (the most discriminative) and 56 part of speech tags (for words not selected by IG). SVMs are trained using the graded dataset (Weekly Reader), where each text is represented as a set of features including traditional readability assessment superficial features such as average sentence length, average number of syllables per word, and the Flesch-Kincaid index together with more-sophisticated features such as syntax-based features, vocabulary features, and language model features. Syntax-based features are extracted from parsed sentences [Charniak, 2000] and include average parse tree height, average number of noun phrases, average number of verb phrases, and average number of clauses (SBARs in the Penn Treebank tag set3). Vocabulary features account for out-of-vocabulary (OOV) word occurrences in the text. These are computed as percentages of words or word types not found in the most common 100, 200, and 500 words occurring in 2nd-grade texts. Concerning language model features, there are 12 perplexity values for 12 different language models computed using 12 different combinations of the paired datasets Britannica/CNN (adults vs. children) and three different n-grams: unigrams, bigrams, and trigrams (combining discriminative words and POS tags). The authors obtained better results in comparison to traditional readability formulas when their language model features are used in combination with vocabulary features, syntax-based features, and superficial indicators. Petersen and Ostendorf [2007] extend the previous work by considering additional non-graded data from newspaper articles to represent higher grade levels (more useful for classification than for regression).
2.3.3 DISCOURSE, SEMANTICS, AND COHESION IN ASSESSING READABILITY
Feng et al. [2009] are specially interested in readability for individuals with mild-level intellectual disabilities (MID) (e.g., intelligence quotient (IQ) in the 55–70 range) and how to select appropriate reading material for this population. The authors note that people with MID are different from adults with low literacy in that the former have problems with working memory and with discourse representation, thereby complicating the processes of recalling information and inference as they read a text. The authors argue that appropriate readability assessment tools which take into account the specific issues of these users should therefore be designed. Their main research hypothesis being that the number of entity mentions in a text should be related to readability issues for people with MID, they design a series of features accounting for entity density. Where data for studying this specific population is concerned, they have created a small (20 documents in original and simplified versions) but rather unique ID dataset for testing their readability prediction model. The dataset is composed of news documents with aggregated readability scores based on the number of correct answers to multiple choice questions that 14 MID individuals had given after reading the texts. In order to train a model, they rely on the availability of paired and generic graded corpora. The paired dataset (not graded) is composed of original articles from Encyclopedia Britannica written for adults and their adapted versions for children and CNN news stories from the LiteracyNet organization available in original and abridged (or simplified) versions. The graded dataset is composed of articles for students in grades 2–5. Where the model’s features are concerned, although many features studied were already available (or similar) in previous work, novel features take into account the number and the density of entity mentions (i.e., nouns and named entities), the number of lexical chains in the text, average lexical chain length, etc. These features are assessed on the paired datasets so as to identify their discriminative power, leaving all but two features outside the model. Three rich readability prediction models (corresponding to basic, cognitively motivated, and union of all features) are then trained on the graded dataset (80% of the dataset) using a linear regression algorithm (unlike the above approach). Evaluation is carried out on 20% of the dataset, showing considerable error reduction (difference between predicted and gold grade) of the models when compared with a baseline readability formula (the Flesch-Kincaid index [Kincaid et al.]). The final user-specific evaluation is conducted on the ID corpus where the model is evaluated by computing the correlation between system output and human readability scores associated with texts.
Feng et al. [2010] extended the previous work by incorporating additional features (e.g., language model features and out-of-vocabulary features from Schwarm and Ostendorf [2005] and entity coreference and coherence-based features based on those of Barzilay and Lapata [2008] and Pitler and Nenkova [2008]), assessing performance of each group of features, and comparing their model to state-of-the-art competing approaches (i.e., mainly replicating the models of Schwarm and Ostendorf [2005]). Experimental results using SVMs and logistic regression classifiers show that although accuracy is still limited (around 74% with SVMs and selected features) important gains are obtained from the use of more sophisticated linguistically motivated features.
Heilman et al. [2007] are interested in the effect of pedagogically motivated features in the development of readability assessment tools, especially in the case of texts for second language (L2) learners. More specifically, they suggest that since L2 learners acquire lexicon and grammar of the target language from exposure to material specifically chosen for the acquisition process, both lexicon and grammar should play a role in assessing the reading difficulty of the L2 learning material. In terms of lexicon, a unigram language model is proposed for each grade level so as to assess the likelihood of a given text to a given grade (see Section 2.3.1 for a similar approach). Where syntactic information is concerned, two different sets of features are proposed: (i) a set of 22 grammatical constructions (e.g., passive voice, relative clause) identified in sentences after being parsed by the Stanford Parser [Klein and Manning, 2003], which produces syntactic constituent structures; and (ii) 12 grammatical features (e.g., sentence length, verb tenses, part of speech tags) which can be identified without the need of a syntactic parser. All feature values are numerical, indicating the number of times the particular feature occurred per word in the text (note that other works take averages on a per-sentence basis). Texts represented as vectors of features and values are used in a k-Nearest Neighbor (kNN) algorithm (see Mitchell [1997]) to predict the readability grade of unseen texts: a given text t is compared (using a similarity measure) to all available vectors and the k-closest texts retrieved, the grade level of t is then the most frequent grade among the k retrieved texts. While the lexical model above will produce, for each text and grade, a probability, the confidence of the kNN prediction can be computed as the proportion of the k texts with same class as text t. The probability of the language model together with the kNN confidence can be interpolated yielding a confidence score to obtain a joint grade prediction model. In order to evaluate different individual models and combinations, the authors use one dataset for L1 learners (a web corpus [Collins-Thompson and Callan, 2004]) and a second dataset for L2 learners (collected from several sources). Prediction performance is carried out using correlation and MSE, since the authors argue regression is a more appropriate way to see readability assessment. Overall, although the lexical model in isolation is superior to the two grammatical models (in both datasets), their combination shows significant advantages. Moreover, although the complex syntactic features have better predictive power than the simple syntactic features, their slight difference in performance may justify