Abstract
Multiple sound signals, such as speech and interfering noises, can be fairly well separated, localized, and interpreted by human listeners with normal binaural hearing. The computational model presented here, based on earlier cochlear modeling work, is a first step at approaching human levels of performance on the localization and separation tasks. This combination of cochlear and binaural models, implemented as real-time algorithms, could provide the front end for a robust sound interpretation system such as a speech recognizer. The cochlear model used is basically a bandpass filterbank with frequency channels corresponding to places on the basilar membrane; filter outputs are half-wave rectified and amplitude-compressed, maintaining fine time resolution. In the binaural model, outputs of corresponding frequency channels from the two ears are combined by cross-correlation. Peaks in the short-time cross-correlation functions are then interpreted as direction. With appropriate preprocessing, the correlation peaks integrate cues based on signal phase, envelope modulation, onset time, and loudness. Based on peaks in the correlation functions, sources can be recognized, localized, and tracked. Through quickly varying gains, sound fragments are separated into streams representing different sources. Preliminary tests of the algorithms are very encouraging.

This publication has 5 references indexed in Scilit: