• lexico-semantic (vocabulary) features: relative word frequencies, type/token ratio, probabilistic language model measures such as text probability, perplexity, etc., and word maturity measures;
• psycholinguistic features: word age-of-acquisition, word concreteness, polysemy, etc.;
• syntactic features (designed to model sentence processing time): sentence length, parse tree height, etc.;
• discourse features (designed to model text’s cohesion and coherence): coreference chains, named entities, lexical tightness, etc.; and
• semantic and pragmatic features: use of idioms, cultural references, text type (opinion, satire, etc.), etc.
Collins-Thompson argues that in readability assessment it seems the model used—the features—is more important than the machine learning approach chosen. That is, a well-designed set of features can go a long way in readability assessment.
2.2 READABILITY FORMULAS
DuBay [2004] points out that over 200 readability formulas existed by the 1980s. Many of them have been empirically tested to assess their predictive power usually by correlating their outputs with grade levels associated with text sets.
Two of the most widely used readability formulas are the Flesch Reading Ease Score [Flesch, 1949] and the Flesch-Kincaid readability formula [Kincaid et al.]. The Flesch Reading Ease Score uses two text characteristics as proxies: the average sentence length ASL and the average number of syllables per word ASW which are combined in Formula (2.1):
On a given text the score will produce a value between 1 and 100 where the higher the value the easier the text would be. Documents scoring 30 are very difficult to read while those scoring 70 should be easy to read.
The Flesch-Kincaid readability formula (2.2) simplifies the Flesch score to produce a “grade level” which is easily interpretable (i.e., a text with a grade level of eight according to the formula could be thought appropriate for an eighth grader).
Additional formulas used include the FOG readability score [Gunning, 1952] and the SMOG readability score [McLaughlin, 1969]. They are computed using the following equations:
where HW is the percent of “hard” words in the document (a hard word is one with at least three syllables) and PSC is the polysyllable count—the number of words with 3 or more syllables in 30 sentences which shall be picked from the beginning, middle, and end of the document.
Work on readability assessment has also included the idea of using a vocabulary or word list which may contain words together with indications of age at which the particular words should be known [Dale and Chall, 1948a]. These lists are useful to verify whether a given text deviates from what should be known at a particular age or grade level, constituting a rudimentary form of readability language model.
Readability measures have begun to take center stage in assessing the output of text simplification systems; however, their direct applicability is not without controversy. First, a number of recent studies have considered classical readability formulas [Wubben et al., 2012, Zhu et al., 2010], applying them to sentences, while many studies on the design of readability formulas are based on considerable samples from the text to assess or need to consider long text pieces to yield good estimates; their applicability at the sentence level would need to be re-examined because empirical evidence is still needed to justify their use. Second, a number of studies suggest the use of readability formulas as a way to guide the simplification process (e.g., De Belder [2014], Woodsend and Lapata [2011]). However, the manipulation of texts to match a specific readability score may be problematic since chopping sentences or blindly replacing words could produce totally ungrammatical texts, thereby “cheating” the readability formulas (see, for example, Bruce et al. [1981], Davison et al. [1980]).
2.3 ADVANCED NATURAL LANGUAGE PROCESSING FOR READABILITY ASSESSMENT
Over the last decade, traditional readability assessment formulas have been criticized [Feng et al., 2009]. The advances brought forward in areas of natural language processing made possible a whole new set of studies in the area of readability. Current natural language processing studies in the area of readability assessment rely on automatic parsing, availability of psycholinguistic information, and language modeling techniques [Manning et al., 2008] to develop more robust methods. Today it is possible to extract rich syntactic and semantic features from text in order to analyze and understand how they interact to make the text more or less readable.
2.3.1 LANGUAGE MODELS
Various works have considered corpus-based statistical methods for readability assessment. Si and Callan [2001] cast text readability assessment as a text classification or categorization problem where the classes could be grades or text difficulty levels. Instead of considering just surface linguistic features, they argue, quite naturally, that the content of the document is a key factor contributing to its readability. After observing that some surface features such as syllable count were not useful predictors of grade level in the dataset adopted (syllabi of elementary and middle school science courses of various readability levels from the Web), they combined a unigram language model with a sentence-length language model in the following approach:
where g is a grade level, d is the document, Pa is a unigram language model, Pb is a sentence-length distribution model, and λ is a coefficient adjusted to yield optimal performance. Note that probability parameters in Pa are words, that is the document should be seen as d = w1 … wn with wl the word at position l in the document, while in probability Pb the parameters are sentence lengths, so a document with k sentences should be thought as d = l1 … lk with li the length of the i-th sentence. The Pa probability distribution is a unigram model computed in the usual way using Bayes’s theorem as:
The probabilities are estimates obtained by counting events over a corpus. Where Pb is concerned, a normal distribution model with specific mean and standard deviation is proposed. The combined model of content and sentence length achieves an accuracy of 75% on a blind test set, while the Flesch-Kincaid readability score will just predict 21% of the grades correctly.
2.3.2 READABILITY AS CLASSIFICATION
Schwarm and Ostendorf [2005] see readability assessment as classification and propose the use of SVM algorithms for predicting the readability level of a text based on a set of textual features. In order to train a readability model, they rely on several sources: (i) documents collected from the Weekly Reader1 educational newspaper with 2nd–5th grade levels; (ii) documents from the Encyclopedia Britannica dataset compiled by Barzilay and Elhadad [2003] containing original encyclopedic articles (115) and their corresponding children’s versions (115); and (iii) CNN news stories (111) from the LiteracyNet2 organization available in original and abridged (or simplified) versions. They borrow the idea of Si and Callan [2001], thus devising features based on statistical language modeling. More concretely, given a corpus of documents with say grade k, they create a language model for that grade. Taking 3-gram sequences as units