The Natural Statistics of Audiovisual Speech
Top Cited Papers
Open Access
- 17 July 2009
- journal article
- research article
- Published by Public Library of Science (PLoS) in PLoS Computational Biology
- Vol. 5 (7) , e1000436
- https://doi.org/10.1371/journal.pcbi.1000436
Abstract
Humans, like other animals, are exposed to a continuous stream of signals, which are dynamic, multimodal, extended, and time varying in nature. This complex input space must be transduced and sampled by our sensory systems and transmitted to the brain where it can guide the selection of appropriate actions. To simplify this process, it's been suggested that the brain exploits statistical regularities in the stimulus space. Tests of this idea have largely been confined to unimodal signals and natural scenes. One important class of multisensory signals for which a quantitative input space characterization is unavailable is human speech. We do not understand what signals our brain has to actively piece together from an audiovisual speech stream to arrive at a percept versus what is already embedded in the signal structure of the stream itself. In essence, we do not have a clear understanding of the natural statistics of audiovisual speech. In the present study, we identified the following major statistical features of audiovisual speech. First, we observed robust correlations and close temporal correspondence between the area of the mouth opening and the acoustic envelope. Second, we found the strongest correlation between the area of the mouth opening and vocal tract resonances. Third, we observed that both area of the mouth opening and the voice envelope are temporally modulated in the 2–7 Hz frequency range. Finally, we show that the timing of mouth movements relative to the onset of the voice is consistently between 100 and 300 ms. We interpret these data in the context of recent neural theories of speech which suggest that speech communication is a reciprocally coupled, multisensory event, whereby the outputs of the signaler are matched to the neural processes of the receiver. When we watch someone speak, how much work is our brain actually doing? How much of this work is facilitated by the structure of speech itself? Our work shows that not only are the visual and auditory components of speech tightly locked (obviating the need for the brain to actively bind such information), this temporal coordination also has a distinct rhythm that is between 2 and 7 Hz. Furthermore, during speech production, the onset of the voice occurs with a delay of between 100 and 300 ms relative to the initial, visible movements of the mouth. These temporal parameters of audiovisual speech are intriguing because they match known properties of neuronal oscillations in the auditory cortex. Thus, given what we already know about the neural processing of speech, the natural features of audiovisual speech signals seem to be optimally structured for their interactions with ongoing brain rhythms in receivers.Keywords
This publication has 78 references indexed in Scilit:
- The multisensory roles for auditory cortex in primate vocal communicationHearing Research, 2009
- Different Neural Frequency Bands Integrate Faces and Voices Differently in the Superior Temporal SulcusJournal of Neurophysiology, 2009
- A Hierarchy of Time-Scales and the BrainPLoS Computational Biology, 2008
- Simulation of talking faces in the human brain improves auditory speech recognitionProceedings of the National Academy of Sciences, 2008
- The processing of audio-visual speech: empirical and neural basesPhilosophical Transactions Of The Royal Society B-Biological Sciences, 2007
- Phase Patterns of Neuronal Responses Reliably Discriminate Speech in Human Auditory CortexNeuron, 2007
- Mapping Information Flow in Sensorimotor NetworksPLoS Computational Biology, 2006
- Implicit Multisensory Associations Influence Voice RecognitionPLoS Biology, 2006
- High Gamma Power Is Phase-Locked to Theta Oscillations in Human NeocortexScience, 2006
- Hearing lips and seeing voicesNature, 1976