Predicting reading difficulty with statistical language models
- 1 September 2005
- journal article
- research article
- Published by Wiley in Journal of the American Society for Information Science and Technology
- Vol. 56 (13) , 1448-1462
- https://doi.org/10.1002/asi.20243
Abstract
A potentially useful feature of information retrieval systems for students is the ability to identify documents that not only are relevant to the query but also match the student's reading level. Manually obtaining an estimate of reading difficulty for each document is not feasible for very large collections, so we require an automated technique. Traditional readability measures, such as the widely used Flesch‐Kincaid measure, are simple to apply but perform poorly on Web pages and other nontraditional documents. This work focuses on building a broadly applicable statistical model of text for different reading levels that works for a wide range of documents. To do this, we recast the well‐studied problem of readability in terms of text categorization and use straightforward techniques from statistical language modeling. We show that with a modified form of text categorization, it is possible to build generally applicable classifiers with relatively little training data. We apply this method to the problem of classifying Web pages according to their reading difficulty level and show that by using a mixture model to interpolate evidence of a word's frequency across grades, it is possible to build a classifier that achieves an average root mean squared error of between one and two grade levels for 9 of 12 grades. Such classifiers have very efficient implementations and can be applied in many different scenarios. The models can be varied to focus on smaller or larger grade ranges or easily retrained for a variety of tasks or populations.Keywords
This publication has 13 references indexed in Scilit:
- The Elements of Statistical LearningPublished by Springer Nature ,2001
- Using EM to Classify Text from Labeled and Unlabeled DocumentsPublished by Defense Technical Information Center (DTIC) ,1998
- The measurement of textual coherence with latent semantic analysisDiscourse Processes, 1998
- A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge.Psychological Review, 1997
- Good‐turing frequency estimation without tears*Journal of Quantitative Linguistics, 1995
- TOWARD A THEORY OF CONSTRUCT DEFINITIONJournal of Educational Measurement, 1983
- Effects of Three Types of Vocabulary on Readability of Intermediate Grade Science Textbooks: An Application of Finn's Transfer Feature TheoryReading Research Quarterly, 1983
- An algorithm for suffix strippingProgram: electronic library and information systems, 1980
- Readability: A New ApproachReading Research Quarterly, 1966
- Computer Automation of Two Readability FormulasJournalism Quarterly, 1963