Читать книгу Automatic Text Simplification - Horacio Saggion - Страница 11
ОглавлениеCHAPTER 2
Readability and Text Simplification
A key question in text simplification research is the identification of the complexity of a given text so that a decision can be made on whether or not to simplify it. Identifying the complexity of a text or sentence can help assess whether the output produced by a text simplification system matches the reading ability of the target reader. It can also be used to compare different systems in terms of complexity or simplicity of the produced output. There are a number of very complete surveys on the relevant topic of text readability which can be understood as “what makes some texts easier to read than others” [Benjamin, 2012, Collins-Thompson, 2014, DuBay, 2004]. Text readability, which has been investigated for a long time in academic circles, is very close to the “to simplify or not to simplify” question in automatic text simplification. Text readability research has often attempted to devise mechanical methods to assess the reading difficulty of a text so that it can be objectively measured. Classical mechanical text readability formulas combine a number of proxies to obtain a numerical score indicative of the difficulty of a text. These scores could be used to place the texts in an appropriate grade level or used to sort text by difficulty.
2.1 INTRODUCTION
Collins-Thompson [2014]—citing [Dale and Chall, 1948b]—defines text readability as the sum of all elements in textual material that affect a reader’s understanding, reading speed, and level of interest in the material. The ability to quantify the readability of a text has long been a topic of research, but current technology and the availability of massive amounts of text in electronic form has changed research in computational readability assessment, considerably. Today’s algorithms take advantage of advances in natural language processing, cognition, education, psycholinguistics, and linguistics (“all elements in textual material”) to model a text in such a way that a machine learning algorithm can be trained to compute readability scores for texts. Traditional readability measures were based on semantic familiarity of words and the syntactic complexity of sentences. Proxies to measure such elements are, for example, the number of syllables of words or the average number of words per sentence. Most traditional approaches used averages over the set of basic elements (words or sentences) in the text, disregarding order and therefore discourse phenomena. The obvious limitations of early approaches were always clear: words with many syllables are not necessarily complex (e.g., children are probably able to read or understand complex dinosaur names or names of Star Wars characters before more-common words are acquired) and short sentences are not necessarily easy to understand (poetry verses for example). Also, traditional formulas were usually designed for texts that were well formatted (not web data) and relatively long. Most methods are usually dependent on the availability of graded corpora where documents are annotated with grade levels. The grades can be either categorical or ordinal, therefore giving rise to either classification or regression algorithmic approaches. When classification is applied, precision, recall, f-score, and accuracy can be used to measure classification performance and compare different approaches. When regression is applied, Root Mean Squared Error (RMSE) or a correlation coefficient can be used to evaluate the algorithmic performance. In the case of regression, assigning a grade of 4 to a 5th-grade text (1 point difference) is not as serious a mistake as it would be to assign a grade 7 to a 5th-grade text (2 points difference). Collins-Thompson [2014] presents an overview of groups of features which have been accounted for in the readability literature including:
• lexico-semantic (vocabulary) features: relative word frequencies, type/token ratio, probabilistic language model measures such as text probability, perplexity, etc., and word maturity measures;
• psycholinguistic features: word age-of-acquisition, word concreteness, polysemy, etc.;
• syntactic features (designed to model sentence processing time): sentence length, parse tree height, etc.;
• discourse features (designed to model text’s cohesion and coherence): coreference chains, named entities, lexical tightness, etc.; and
• semantic and pragmatic features: use of idioms, cultural references, text type (opinion, satire, etc.), etc.
Collins-Thompson argues that in readability assessment it seems the model used—the features—is more important than the machine learning approach chosen. That is, a well-designed set of features can go a long way in readability assessment.
2.2 READABILITY FORMULAS
DuBay [2004] points out that over 200 readability formulas existed by the 1980s. Many of them have been empirically tested to assess their predictive power usually by correlating their outputs with grade levels associated with text sets.
Two of the most widely used readability formulas are the Flesch Reading Ease Score [Flesch, 1949] and the Flesch-Kincaid readability formula [Kincaid et al.]. The Flesch Reading Ease Score uses two text characteristics as proxies: the average sentence length ASL and the average number of syllables per word ASW which are combined in Formula (2.1):
On a given text the score will produce a value between 1 and 100 where the higher the value the easier the text would be. Documents scoring 30 are very difficult to read while those scoring 70 should be easy to read.
The Flesch-Kincaid readability formula (2.2) simplifies the Flesch score to produce a “grade level” which is easily interpretable (i.e., a text with a grade level of eight according to the formula could be thought appropriate for an eighth grader).
Additional formulas used include the FOG readability score [Gunning, 1952] and the SMOG readability score [McLaughlin, 1969]. They are computed using the following equations:
where HW is the percent of “hard” words in the document (a hard word is one with at least three syllables) and PSC is the polysyllable count—the number of words with 3 or more syllables in 30 sentences which shall be picked from the beginning, middle, and end of the document.
Work on readability assessment has also included the idea of using a vocabulary or word list which may contain words together with indications of age at which the particular words should be known [Dale and Chall, 1948a]. These lists are useful to verify whether a given text deviates from what should be known at a particular age or grade level, constituting a rudimentary form of readability language model.
Readability measures have begun to take center stage in assessing the output of text simplification systems; however, their direct applicability is not without controversy. First, a number of recent studies have considered classical readability formulas [Wubben et al., 2012, Zhu et al., 2010], applying them to sentences, while many studies on the design of readability formulas are based on considerable samples from the text to assess or need to consider long text pieces to yield good estimates; their applicability at the sentence level would need to be re-examined because empirical evidence is still needed to justify their use. Second, a number of studies suggest the use of readability formulas as a way to guide the simplification process (e.g., De Belder [2014], Woodsend and Lapata [2011]). However, the manipulation of texts to match a specific readability score may be problematic since chopping sentences or blindly replacing words could produce totally ungrammatical texts, thereby “cheating” the readability formulas (see, for example, Bruce et al. [1981], Davison et al. [1980]).
2.3 ADVANCED NATURAL LANGUAGE PROCESSING FOR READABILITY ASSESSMENT
Over the last decade, traditional readability assessment formulas have been criticized [Feng et al., 2009]. The advances brought forward in areas of natural language processing made possible a whole new set of studies in the area of readability. Current natural language processing studies in the area of readability assessment rely on automatic parsing, availability of psycholinguistic information, and language modeling techniques [Manning et al., 2008] to develop more robust methods. Today it is possible to extract rich syntactic and semantic features from text in order to analyze and understand how they interact to make the text more or less readable.
2.3.1 LANGUAGE MODELS
Various works have considered corpus-based statistical methods for readability assessment. Si and Callan [2001] cast text readability assessment as a text classification or categorization problem where the classes could be grades or text difficulty levels. Instead of considering just surface linguistic features, they argue, quite naturally, that the content of the document is a key factor contributing to its readability. After observing that some surface features such as syllable count were not useful predictors of grade level in the dataset adopted (syllabi of elementary and middle school science courses of various readability levels from the Web), they combined a unigram language model with a sentence-length language model in the following approach:
where g is a grade level, d is the document, Pa is a unigram language model, Pb is a sentence-length distribution model, and λ is a coefficient adjusted to yield optimal performance. Note that probability parameters in Pa are words, that is the document should be seen as d = w1 … wn with wl the word at position l in the document, while in probability Pb the parameters are sentence lengths, so a document with k sentences should be thought as d = l1 … lk with li the length of the i-th sentence. The Pa probability distribution is a unigram model computed in the usual way using Bayes’s theorem as:
The probabilities are estimates obtained by counting events over a corpus. Where Pb is concerned, a normal distribution model with specific mean and standard deviation is proposed. The combined model of content and sentence length achieves an accuracy of 75% on a blind test set, while the Flesch-Kincaid readability score will just predict 21% of the grades correctly.
2.3.2 READABILITY AS CLASSIFICATION
Schwarm and Ostendorf [2005] see readability assessment as classification and propose the use of SVM algorithms for predicting the readability level of a text based on a set of textual features. In order to train a readability model, they rely on several sources: (i) documents collected from the Weekly Reader1 educational newspaper with 2nd–5th grade levels; (ii) documents from the Encyclopedia Britannica dataset compiled by Barzilay and Elhadad [2003] containing original encyclopedic articles (115) and their corresponding children’s versions (115); and (iii) CNN news stories (111) from the LiteracyNet2 organization available in original and abridged (or simplified) versions. They borrow the idea of Si and Callan [2001], thus devising features based on statistical language modeling. More concretely, given a corpus of documents with say grade k, they create a language model for that grade. Taking 3-gram sequences as units for modeling the text, the probability p(w) of a word sequence w = w1 … wn in the k-grade corpus is computed as:
where the 3-gram probabilities are estimated using 3-gram frequencies observed in the k-grade documents and smoothing techniques to account for unobserved events. Given the probabilities of a sequence w in the different models (one per grade), a likelihood ratio of sequence w is defined as:
where the prior p(k) probabilities can be assumed to be uniform. The LR(w, k) values already give some information on the likelihood of the text being of a certain complexity or grade. Additionally, the authors use perplexity as an indicator of the fit of a particular text to a given model where low perplexity for a text t and model m would indicate a better fit of t to m. Worth noting is the reduction of the features of the language models based on feature filtering by information gain (IG) values to 276 words (the most discriminative) and 56 part of speech tags (for words not selected by IG). SVMs are trained using the graded dataset (Weekly Reader), where each text is represented as a set of features including traditional readability assessment superficial features such as average sentence length, average number of syllables per word, and the Flesch-Kincaid index together with more-sophisticated features such as syntax-based features, vocabulary features, and language model features. Syntax-based features are extracted from parsed sentences [Charniak, 2000] and include average parse tree height, average number of noun phrases, average number of verb phrases, and average number of clauses (SBARs in the Penn Treebank tag set3). Vocabulary features account for out-of-vocabulary (OOV) word occurrences in the text. These are computed as percentages of words or word types not found in the most common 100, 200, and 500 words occurring in 2nd-grade texts. Concerning language model features, there are 12 perplexity values for 12 different language models computed using 12 different combinations of the paired datasets Britannica/CNN (adults vs. children) and three different n-grams: unigrams, bigrams, and trigrams (combining discriminative words and POS tags). The authors obtained better results in comparison to traditional readability formulas when their language model features are used in combination with vocabulary features, syntax-based features, and superficial indicators. Petersen and Ostendorf [2007] extend the previous work by considering additional non-graded data from newspaper articles to represent higher grade levels (more useful for classification than for regression).
2.3.3 DISCOURSE, SEMANTICS, AND COHESION IN ASSESSING READABILITY
Feng et al. [2009] are specially interested in readability for individuals with mild-level intellectual disabilities (MID) (e.g., intelligence quotient (IQ) in the 55–70 range) and how to select appropriate reading material for this population. The authors note that people with MID are different from adults with low literacy in that the former have problems with working memory and with discourse representation, thereby complicating the processes of recalling information and inference as they read a text. The authors argue that appropriate readability assessment tools which take into account the specific issues of these users should therefore be designed. Their main research hypothesis being that the number of entity mentions in a text should be related to readability issues for people with MID, they design a series of features accounting for entity density. Where data for studying this specific population is concerned, they have created a small (20 documents in original and simplified versions) but rather unique ID dataset for testing their readability prediction model. The dataset is composed of news documents with aggregated readability scores based on the number of correct answers to multiple choice questions that 14 MID individuals had given after reading the texts. In order to train a model, they rely on the availability of paired and generic graded corpora. The paired dataset (not graded) is composed of original articles from Encyclopedia Britannica written for adults and their adapted versions for children and CNN news stories from the LiteracyNet organization available in original and abridged (or simplified) versions. The graded dataset is composed of articles for students in grades 2–5. Where the model’s features are concerned, although many features studied were already available (or similar) in previous work, novel features take into account the number and the density of entity mentions (i.e., nouns and named entities), the number of lexical chains in the text, average lexical chain length, etc. These features are assessed on the paired datasets so as to identify their discriminative power, leaving all but two features outside the model. Three rich readability prediction models (corresponding to basic, cognitively motivated, and union of all features) are then trained on the graded dataset (80% of the dataset) using a linear regression algorithm (unlike the above approach). Evaluation is carried out on 20% of the dataset, showing considerable error reduction (difference between predicted and gold grade) of the models when compared with a baseline readability formula (the Flesch-Kincaid index [Kincaid et al.]). The final user-specific evaluation is conducted on the ID corpus where the model is evaluated by computing the correlation between system output and human readability scores associated with texts.
Feng et al. [2010] extended the previous work by incorporating additional features (e.g., language model features and out-of-vocabulary features from Schwarm and Ostendorf [2005] and entity coreference and coherence-based features based on those of Barzilay and Lapata [2008] and Pitler and Nenkova [2008]), assessing performance of each group of features, and comparing their model to state-of-the-art competing approaches (i.e., mainly replicating the models of Schwarm and Ostendorf [2005]). Experimental results using SVMs and logistic regression classifiers show that although accuracy is still limited (around 74% with SVMs and selected features) important gains are obtained from the use of more sophisticated linguistically motivated features.
Heilman et al. [2007] are interested in the effect of pedagogically motivated features in the development of readability assessment tools, especially in the case of texts for second language (L2) learners. More specifically, they suggest that since L2 learners acquire lexicon and grammar of the target language from exposure to material specifically chosen for the acquisition process, both lexicon and grammar should play a role in assessing the reading difficulty of the L2 learning material. In terms of lexicon, a unigram language model is proposed for each grade level so as to assess the likelihood of a given text to a given grade (see Section 2.3.1 for a similar approach). Where syntactic information is concerned, two different sets of features are proposed: (i) a set of 22 grammatical constructions (e.g., passive voice, relative clause) identified in sentences after being parsed by the Stanford Parser [Klein and Manning, 2003], which produces syntactic constituent structures; and (ii) 12 grammatical features (e.g., sentence length, verb tenses, part of speech tags) which can be identified without the need of a syntactic parser. All feature values are numerical, indicating the number of times the particular feature occurred per word in the text (note that other works take averages on a per-sentence basis). Texts represented as vectors of features and values are used in a k-Nearest Neighbor (kNN) algorithm (see Mitchell [1997]) to predict the readability grade of unseen texts: a given text t is compared (using a similarity measure) to all available vectors and the k-closest texts retrieved, the grade level of t is then the most frequent grade among the k retrieved texts. While the lexical model above will produce, for each text and grade, a probability, the confidence of the kNN prediction can be computed as the proportion of the k texts with same class as text t. The probability of the language model together with the kNN confidence can be interpolated yielding a confidence score to obtain a joint grade prediction model. In order to evaluate different individual models and combinations, the authors use one dataset for L1 learners (a web corpus [Collins-Thompson and Callan, 2004]) and a second dataset for L2 learners (collected from several sources). Prediction performance is carried out using correlation and MSE, since the authors argue regression is a more appropriate way to see readability assessment. Overall, although the lexical model in isolation is superior to the two grammatical models (in both datasets), their combination shows significant advantages. Moreover, although the complex syntactic features have better predictive power than the simple syntactic features, their slight difference in performance may justify not using a parser.
Although these works are interesting because they consider a different user population, they still lack an analysis of the effect that different automatic tools have in readability assessment performance: since parsers, coreference resolution systems, and lexical chainers are imperfect, an important question to be asked is how changes in performance affect the model outcome.
Crossley et al. [2007] investigate three Coh-Metrix variables [Graesser et al., 2004] for assessing the readability of texts from the Bormuth corpus, a dataset where scores are given to texts based on aggregated answers from informants using cloze tests. The number of words per sentence as an estimate of syntactic complexity, argument overlap—the number of sentences sharing an argument (noun, pronouns, noun phrases)—, and word frequencies from the CELEX database [Celex, 1993] were used in a multiple regression analysis. Correlation between the variables used and the text scores was very high.
Flor and Klebanov [2014] carried out one of the few studies (see Feng et al. [2009]) to assess lexical cohesion [Halliday and Hasan, 1976] for text readability assessment. Since cohesion is related to the way in which elements in the text are tied together to allow text understanding, a more cohesive text may well be perceived as more readable than a less cohesive text. Flor and Klebanov define lexical tightness, a metric based on a normalized form of pointwise mutual information by Church and Hanks [1990] (NPMI) that measures the strength of associations between words in a given document based on co-occurrence statistics compiled from a large corpus. The lexical tightness of a text is the average of NPMIs values of all content words in the text. It is shown that lexical tightness correlates well with grade levels: simple texts tend to be more lexically cohesive than difficult ones.
2.4 READABILITY ON THE WEB
There is increasing interest in assessing document readability in the context of web search engines and in particular for personalization of web search results: search results that, in addition to matching the user’s query, are ranked according to their readability (e.g., from easier to more difficult). One approach is to display search results along with readability levels (Google Search offered in the past the possibility of filtering search results by reading level) so that users could select material based on its reading level assessment; however, this is limited in that the profile or expertise of the reader (i.e., search behavior) is not taken into consideration when presenting the results. Collins-Thompson et al. [2011] introduced a tripartite approach to personalization of search results by reading level (appropriate documents for the user’s readability level should be ranked higher) which takes advantage of user profiles (to assess their readability level), document difficulty, and a re-ranking strategy so that documents more appropriate for the reader would move to the top of the search result list. They use a language-model readability assessment method which leverages word difficulty computed from a web corpus in which pages have been assigned grade levels by their authors [Collins-Thompson and Callan, 2004]. The method departs from traditional readability formulas in that it is based on a probabilistic estimation that models individual word complexity as a distribution across grade levels. Text readability is then based on distribution of those words occurring in the document. The authors argue that traditional formulas which consider morphological word complexity and sentence complexity (e.g., length) features and that sometimes require word-passages of certain sizes (i.e., at least 100 words) to yield an accurate readability estimate appear inappropriate in a web context where sentence boundaries are sometimes nonexistent and pages can have very little textual content (e.g., images and captions). To estimate the reading proficiency of users and also to train some of the model parameters and evaluate their approach, they rely on the availability of proprietary data on user-interaction behaviors with a web search engine (containing queries, search results, and relevance assessment). With this dataset at hand, the authors can compute a distribution of the probability that a reader likes the readability level of a given web page from web pages that the user visited and read. A re-ranking algorithm, LambdaMART [Wu et al., 2010], is then used to improve the search results and bring results more appropriate to the user to the top of the search result list. The algorithm is trained using reading level for pages and snippets (i.e., search results summaries), user reading level, query characteristics (e.g., length), reading level interactions (e.g., snippet-page, query-page), and confidence values for many of the computed features. Re-ranking experiments across a variety of query-types indicate that search results improve at least one rank for all queries (i.e., the appropriate URL was ranked higher than with the default search engine ranking algorithm). Related to work on web documents readability is the question of how different ways in which web pages are parsed (i.e., extracting the text of the document and identifying sentence boundaries) influence the outcome of traditional readability measures. Palotti et al. [2015] study different tools for extracting and sentence-splitting textual content from pages and different traditional readability formulas. They found that web search results ranking varies considerably depending on different readability formulas and text processing methods used and also that some text processing methods would produce document rankings with marginal correlation when a given formula is used.
2.5 ARE CLASSIC READABILITY FORMULAS CORRELATED?
Given the proliferation of readability formulas, one may wonder how they differ and which one should be used for assessing the difficulty of a given text. Štajner et al. [2012] study the correlation of a number of classic readability formulas and linguistically motivated features using different corpora to identify which formula or linguistic characteristics may be used to select appropriate text for people with an autism-spectrum disorder.
The corpora included in the study were: 170 texts from Simple Wikipedia, 171 texts from a collection of news texts from the METER corpus, 91 texts from the health section of the British National Corpus, and 120 fiction texts from the FLOB corpus.4 The readability formulas studied were the Flesch Reading Ease score, the Flesch-Kincaid grade level, the SMOG grading, and FOG index. According to the authors, the linguistically motivated features were designed to detect possible “linguistic obstacles” that a text may have to hinder readability. They include features of structural complexity such as the average number of major POS tags per sentence, average number of infinitive markers, coordinating and subordinating conjunctions, and prepositions. Features indicative of ambiguity include the average number of sentences per word, average number of pronouns and definite descriptions per sentence. The authors first computed over each corpus averages of each readability score to identify which corpora were “easier” according to the formulas. To their surprise and according to all four formulas, the corpus of fiction texts appears to be the easiest to read, with health-related documents at the same readability level as Simple Wikipedia articles. In another experiment, they study the correlation of each pair of formulas in each corpus; their results indicate almost perfect correlation, indicating the formulas could be interchangeable. Their last experiment, which studies the correlation between the Flesch-Kincaid formula and the different linguistically motivated features, indicates that although most features are strongly correlated with the readability formula, the strength of the correlation varies from corpus to corpus. The authors suggest that because of the correlation of the readability formula with linguistic indicators of reading difficulty, the Flesch score could be used to assess the difficulty level of texts for their target audience.
2.6 SENTENCE-LEVEL READABILITY ASSESSMENT
Most readability studies consider the text as the unit for assessment (although Collins-Thompson et al. [2011] present a study also for text snippets and search queries); however, some authors have recently become interested in assessing readability of short units such as sentences. Dell’Orletta et al. [2014a,b], in addition to presenting a readability study for Italian where they test the value of different features for classification of texts into easy or difficult, also address the problem of classifying sentences as easy-to-read or difficult-to-read. The problem they face is the unavailability of annotated corpora for the task, so they rely on documents from two different providers: easy-to-read documents are sampled from the easy-to-read newspaper Due Parole5 while the difficult-to-read documents are sampled from the newspaper La Repubblica.6 Features for document classification included in their study are: raw text features such as sentence-length and word-length averages, lexical features such as type/token ratio (i.e., lexical variety) and percentage of words on different Italian word reference lists, etc., morpho-syntactic features such as probability distributions of POS tags in the text, ratio of the number of content words (nouns, verbs, adjectives, adverbs) to number of words in the text, etc., and syntactic features such as average depth of syntactic parse trees, etc. For sentence readability classification (easy-to-read vs. difficult-to-read), they prepared four different datasets based on the document classification task. Sentences from Due Parole are considered easy-to-read; however, assuming that all sentences from La Reppublica are difficult would in principle be an incorrect assumption. Therefore, they create four different sentence classification datasets for training models and assess the need for manually annotated data: the first set (s1) is a balanced dataset of easy-to-read and difficult-to-read sentences (1310 sentences of each class); the second dataset (s2) is an un-balanced dataset of easy-to-read (3910 sentences) and assumed difficult-to-read sentences (8452), the third dataset (s3) is a balanced dataset with easy-to-read (3910 sentences) and assumed difficult-to-read sentences (3910); and, finally, the fourth dataset (s4) also contains easy-to-read sentences (1310) and assumed difficult-to-read sentences (1310). They perform classification experiments with maximum entropy models to discriminate between easy-to-read and difficult-to-read sentences, using held-out manually annotated data. They noted that although using the gold-standard dataset (s1) provides the best results in terms of accuracy, using a balanced dataset of “assumed” difficult-to-read sentences (i.e., s3) for training is close behind, suggesting that one should trade off the efforts of manually filtering out difficult-sentences to create a dataset. TThey additionally study feature contribution to sentence readability and document readability, noting that local features based on syntax are more relevant for sentence classification while global features such as average sentence and word lengths or token/type ratio are more important for document readability assessment.
Vajjala and Meurers [2014] investigate the issue of readability assessment for English, also focusing on the readability of sentences. Their approach is based on training two different regression algorithms on WeeBit, a corpus of 625 graded documents for age groups 7 to 16 years that they have specifically assembled, which contains articles from the Weekly Reader (see above) and articles from the BBCBitesize website. As in previous work, the model contains a number of different groups of features accounting for lexical and POS tag distribution information, superficial characteristics (e.g., word length) and classical readability indices (e.g., Flesch-Kincaid), age-of-acquisition word information, word ambiguity, etc., 10-fold cross-validation evaluation, using correlation and means error rate metrics, is carried out as is validation on available standard datasets from the Common Core Standards corpus7 (168 documents), the TASA corpus (see Vajjala and Meurers [2014] for details) (37K documents), and the Math Readability corpus8 (120 web pages). The model achieves high correlation in cross-validation and reasonable correlation across datasets, except in the Math corpus probably because of the rating scale used. The approach also compares very favorably with respect to several proprietary systems. Where sentence readability is concerned, the model trained on the WeeBit corpus is applied to sentences from the OneStopEnglish corpus,9 a dataset in which original documents (30 articles, advanced level) have been edited to obtain documents at intermediate and beginner reading levels. Experiments are first undertaken to assess whether the model is able to separate the three different types of documents, and then to evaluate a sentence readability model. To evaluate sentence readability, each pair of parallel documents (advanced-intermediate, intermediate-beginner, advanced-beginner) is manually sentence-aligned and experiments are carried out to test whether the model is able to preserve the relative readability order of the aligned sentences (e.g., advanced-level sentence less readable than beginner-level sentence). Overall, the model preserves the readability order in 60% of the cases.
2.7 READABILITY AND AUTISM
Yaneva et al. [2016a,b] study text and Web accessibility for people with ASD. They developed a small corpus composed of 27 documents evaluated by 27 people diagnosed with an ASD. The novelty of the corpus is that, in addition to induced readability levels, it also contains gaze data obtained from eye-tracking experiments in which ASD subjects (and a control group of non-ASD subjects) were measured reading the texts, after which they were asked multiple-choice text-comprehension questions. The overall difficulty of the texts was obtained from quantitative data relating to answers given to those comprehension questions. Per each text, correct and incorrect answers were counted and text ranked based on number of correct answers. The ranking provided a way to separate texts into three difficulty levels: easy, medium, and difficult. The corpus itself was not used to develop a readability model; instead, it was used as test data for a readability model trained on the WeeBit corpus (see previous section), which was transformed into a 3-way labeled dataset (only 3 difficulty levels were extracted from WeeBit to comply with the ASD corpus).
Yaneva et al. grouped sets of features according to the different types of phenomena that account for: (i) lexico-semantic information such as word characteristics (length, syllables, etc.), numerical expressions, passive verbs, etc.; (ii) superficial syntactic information such as sentence length or punctuation information; (iii) cohesion information such as occurrence of pronouns and definite descriptions, etc.; (iv) cognitively motivated information including word frequency, age of acquisition of words, word imagability, etc.; and (v) information arising from several readability indices such as the Flesch-Kincaid Grade Level and the FOG readability index, etc. Two decision-tree algorithms, random forest [Breiman, 2001] and reduced error pruning tree (see [Hall et al., 2009]), were trained on the WeeBit corpus (see previous section) and cross-validated in WeeBit and tested in the ASD corpus. Feature optimization was carried out using a best-first feature selection strategy which identified such features as polysemy, FOG index, incidence of pronouns, sentence length, age of acquisition, etc. The feature selection procedure yields a model with improved performance on training and test data; nonetheless, results of the test on the ASD corpus are not optimal when compared with the cross-validation results on WeeBit. Worth noting is that although some of the features selected might be representative of problems ASD subjects may encounter when reading text, these features emerged from a corpus (WeeBit) that is not ASD-specific, suggesting that the selected features model general text difficulty assessment.
Based on the ASD corpus, a sentence readability assessment dataset was prepared composed of 257 sentences. Sentences were classified into easy-to-read and difficult based on the eye-tracker data associated with the texts. Sentences were ranked based on the average number of fixations they had during the readability assessment experiments and the set of sentences split in two parts to yield the two sentence readability classes. To complement the sentences from ASD and to control for length, short sentences from publicly available sources [Laufer and Nation, 1999] were added to the dataset. The labels for these sentences were obtained through a comprehension questionnaire which subjects with ASD had to answer. Sentences were considered easy to read if at least 60% of the subjects answered correctly the comprehension question associated with the sentence. Binary classification experiments on this dataset were performed using the Pegasos algorithm [Shalev-Shwartz et al., 2007] with features to model superficial sentence characteristics (number of words, word length, etc.), cohesion (proportion of connectives, causal expressions, etc.), cognitive indicators (word concreteness, imagability, polysemy, etc.), and several incidence counts (negation, pronouns, etc.). A cross-validation experiment achieved 0.82 F-score using a best-first feature selection strategy.
2.8 CONCLUSION
Over the years researchers have tried to come up with models able to predict the difficulty of a given text. Research on readability assessment is important for automatic text simplification in that models of readability assessment can help identify texts or text fragments which would need some sort of adaptation in order to made them accessible for a specific audience. Readability assessment can also help developers in the evaluation of automatic text simplification systems. Although traditional formulas which rely on simple superficial proxies are still used, in recent years, the availability of sophisticated natural language processing tools and better understanding of text properties accounting for text quality, cohesion, and coherence have fueled research in readability assessment, notably in computational linguistics.
This chapter covered several aspects of the readability assessment problem including reviewing classical readability formulas, presenting several advanced computational approaches based on machine learning techniques and sophisticated linguistic features, and pointing out current interest for readability for specific target populations as well as for texts of peculiar characteristics such as web pages.
2.9 FURTHER READING
Readability formulas and studies have been proposed for many different languages. For Basque, an agglutinative language with rich morphology, Gonzalez-Dios et al. [2014] recently proposed using a number of Basque-specific features to separate documents with two different readability levels, achieving over 90% accuracy (note that is only binary classification). A readability formula developed for Swedish, the Lix index, which uses word length and sentence length as difficulty factors, has been used in many other languages [Anderson, 1981]. There has been considerable research on readability in Spanish [Anula Rebollo, 2008, Rodríguez Diéguez et al., 1993, Spaulding, 1956] and its application to automatic text simplification evaluation [Štajner and Saggion, 2013a].
1 http://www.weeklyreader.com
2 http://literacynet.org/
3 https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
4 http://www.helsinki.fi/varieng/CoRD/corpora/FLOB/
5 http://www.dueparole.it/default_.asp
6 http://www.repubblica.it/
7 http://www.corestandards.org/ELA-Literacy/
8 http://wing.comp.nus.edu.sg/downloads/mwc/
9 http://www.onestopenglish.com/