Good‐turing frequency estimation without tears*
- 1 January 1995
- journal article
- research article
- Published by Taylor & Francis in Journal of Quantitative Linguistics
- Vol. 2 (3) , 217-237
- https://doi.org/10.1080/09296179508590051
Abstract
Linguists and speech researchers who use statistical methods often need to estimate the frequency of some type of item in a population containing items of various types. A common approach is to divide the number of cases observed in a sample by the size of the sample; sometimes small positive quantities are added to divisor and dividend in order to avoid zero estimates for types missing from the sample. These approaches are obvious and simple, but they lack principled justification, and yield estimates that can be wildly inaccurate. I.J. Good and Alan Turing developed a family of theoretically well‐founded techniques appropriate to this domain. Some versions of the Good‐Turing approach are very demanding computationally, but we define a version, the Simple Good‐Turing estimator, which is straightforward to use. Tested on a variety of natural‐language‐related data sets, the Simple Good‐Turing estimator performs well, absolutely and relative both to the approaches just discussed and to other, more sophisticated techniques.Keywords
This publication has 16 references indexed in Scilit:
- A corpus-based model of interstress timing and structureThe Journal of the Acoustical Society of America, 1993
- A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigramsComputer Speech & Language, 1991
- Estimation of probabilities from sparse data for the language model component of a speech recognizerIEEE Transactions on Acoustics, Speech, and Signal Processing, 1987
- On the choice of flattening constants for estimating multinomial probabilitiesJournal of Multivariate Analysis, 1972
- THE NUMBER OF NEW SPECIES, AND THE INCREASE IN POPULATION COVERAGE, WHEN A SAMPLE IS INCREASEDBiometrika, 1956
- THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERSBiometrika, 1953
- On the Estimation of the Number of Classes in a PopulationThe Annals of Mathematical Statistics, 1949
- The Relation Between the Number of Species and the Number of Individuals in a Random Sample of an Animal PopulationJournal of Animal Ecology, 1943
- I.—PROBABILITY: THE DEDUCTIVE AND INDUCTIVE PROBLEMSMind, 1932
- On the mathematical foundations of theoretical statisticsPhilosophical Transactions of the Royal Society A, 1922