Beyond word frequency: Bursts, lulls, and scaling in the temporal distributions of words

  • 15 January 2009
Abstract
Zipf's discovery that word frequency distributions obey a power law established parallels between biological and physical processes, and language, laying the groundwork for a complex systems perspective on human communication. By considering frequent words in USENET discussion groups and in disparate databases where the language has different levels of formality, here we show that the distributions of distances between successive occurrences of the same word display bursty deviations from a Poisson distribution and are well characterized by a stretched exponential scaling. The extent of this deviation depends strongly on semantic type - a measure of the abstractness of eachword - and only weakly on frequency. We develop a generative model of this behavior, deriving the stretched exponential distribution of recurrence times as a new empirical scaling law that cannot be anticipated from Zipf's law. Our analysis not merely describes deviations from a simple bag-of-words model, it accounts for differential deviations relating to the semantics of words and their functions in marking topics of the text or discourse. Because the use of words provides a uniquely precise and powerful lens on human thought and activity, our findings also have implications for other overt manifestations of collective human dynamics.

This publication has 0 references indexed in Scilit: