Predicting reading difficulty with statistical language models

1 September 2005

journal article
research article
Published by Wiley in Journal of the American Society for Information Science and Technology

Vol. 56 (13) , 1448-1462
https://doi.org/10.1002/asi.20243

Abstract

A potentially useful feature of information retrieval systems for students is the ability to identify documents that not only are relevant to the query but also match the student's reading level. Manually obtaining an estimate of reading difficulty for each document is not feasible for very large collections, so we require an automated technique. Traditional readability measures, such as the widely used Flesch‐Kincaid measure, are simple to apply but perform poorly on Web pages and other nontraditional documents. This work focuses on building a broadly applicable statistical model of text for different reading levels that works for a wide range of documents. To do this, we recast the well‐studied problem of readability in terms of text categorization and use straightforward techniques from statistical language modeling. We show that with a modified form of text categorization, it is possible to build generally applicable classifiers with relatively little training data. We apply this method to the problem of classifying Web pages according to their reading difficulty level and show that by using a mixture model to interpolate evidence of a word's frequency across grades, it is possible to build a classifier that achieves an average root mean squared error of between one and two grade levels for 9 of 12 grades. Such classifiers have very efficient implementations and can be applied in many different scenarios. The models can be varied to focus on smaller or larger grade ranges or easily retrained for a variety of tasks or populations.

Keywords

This publication has 13 references indexed in Scilit:

The Elements of Statistical Learning
Published by Springer Nature ,2001
Using EM to Classify Text from Labeled and Unlabeled Documents
Published by Defense Technical Information Center (DTIC) ,1998
The measurement of textual coherence with latent semantic analysis
Discourse Processes, 1998
A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge.
Psychological Review, 1997
Good‐turing frequency estimation without tears*
Journal of Quantitative Linguistics, 1995
TOWARD A THEORY OF CONSTRUCT DEFINITION
Journal of Educational Measurement, 1983
Effects of Three Types of Vocabulary on Readability of Intermediate Grade Science Textbooks: An Application of Finn's Transfer Feature Theory
Reading Research Quarterly, 1983
An algorithm for suffix stripping
Program: electronic library and information systems, 1980
Readability: A New Approach
Reading Research Quarterly, 1966
Computer Automation of Two Readability Formulas
Journalism Quarterly, 1963