The Natural Statistics of Audiovisual Speech

Top Cited Papers

Open Access

17 July 2009

journal article
research article
Published by Public Library of Science (PLoS) in PLoS Computational Biology

Vol. 5 (7) , e1000436
https://doi.org/10.1371/journal.pcbi.1000436

Abstract

Humans, like other animals, are exposed to a continuous stream of signals, which are dynamic, multimodal, extended, and time varying in nature. This complex input space must be transduced and sampled by our sensory systems and transmitted to the brain where it can guide the selection of appropriate actions. To simplify this process, it's been suggested that the brain exploits statistical regularities in the stimulus space. Tests of this idea have largely been confined to unimodal signals and natural scenes. One important class of multisensory signals for which a quantitative input space characterization is unavailable is human speech. We do not understand what signals our brain has to actively piece together from an audiovisual speech stream to arrive at a percept versus what is already embedded in the signal structure of the stream itself. In essence, we do not have a clear understanding of the natural statistics of audiovisual speech. In the present study, we identified the following major statistical features of audiovisual speech. First, we observed robust correlations and close temporal correspondence between the area of the mouth opening and the acoustic envelope. Second, we found the strongest correlation between the area of the mouth opening and vocal tract resonances. Third, we observed that both area of the mouth opening and the voice envelope are temporally modulated in the 2–7 Hz frequency range. Finally, we show that the timing of mouth movements relative to the onset of the voice is consistently between 100 and 300 ms. We interpret these data in the context of recent neural theories of speech which suggest that speech communication is a reciprocally coupled, multisensory event, whereby the outputs of the signaler are matched to the neural processes of the receiver. When we watch someone speak, how much work is our brain actually doing? How much of this work is facilitated by the structure of speech itself? Our work shows that not only are the visual and auditory components of speech tightly locked (obviating the need for the brain to actively bind such information), this temporal coordination also has a distinct rhythm that is between 2 and 7 Hz. Furthermore, during speech production, the onset of the voice occurs with a delay of between 100 and 300 ms relative to the initial, visible movements of the mouth. These temporal parameters of audiovisual speech are intriguing because they match known properties of neuronal oscillations in the auditory cortex. Thus, given what we already know about the neural processing of speech, the natural features of audiovisual speech signals seem to be optimally structured for their interactions with ongoing brain rhythms in receivers.

Keywords

This publication has 78 references indexed in Scilit:

The multisensory roles for auditory cortex in primate vocal communication
Hearing Research, 2009
Different Neural Frequency Bands Integrate Faces and Voices Differently in the Superior Temporal Sulcus
Journal of Neurophysiology, 2009
A Hierarchy of Time-Scales and the Brain
PLoS Computational Biology, 2008
Simulation of talking faces in the human brain improves auditory speech recognition
Proceedings of the National Academy of Sciences, 2008
The processing of audio-visual speech: empirical and neural bases
Philosophical Transactions Of The Royal Society B-Biological Sciences, 2007
Phase Patterns of Neuronal Responses Reliably Discriminate Speech in Human Auditory Cortex
Neuron, 2007
Mapping Information Flow in Sensorimotor Networks
PLoS Computational Biology, 2006
Implicit Multisensory Associations Influence Voice Recognition
PLoS Biology, 2006
High Gamma Power Is Phase-Locked to Theta Oscillations in Human Neocortex
Science, 2006
Hearing lips and seeing voices
Nature, 1976