Audio-visual sensor fusion system for intelligent sound sensing

Abstract
An intelligent sensing system is proposed, which extracts a target sound signal autonomously from multi-microphone signals corrupted by interference ambient noise. Although many types of intelligent signal receivers with multiple sensors have been proposed recently, the use of audio-visual sensor fusion techniques is a special feature of the system described here. This sensor fusion system can be divided into two subsystems: an audio subsystem and a visual subsystem. The audio subsystem extracts a target signal with a digital filter composed of tapped delay lines and adjustable weights. These weights are renewed by a special adaptive algorithm, which is called the "cue signal method". For adaptation, the cue signal method needs only a narrow bandwidth signal which correlates with the power level of the target signal. This narrow bandwidth signal is called the "cue signal". The role of the visual subsystem is, therefore, to generate a cue signal. The authors have already proposed methods for generating a cue signal using video images. Sensor fusion of audio and visual information was accomplished by simple methods. In this paper, two new sensor fusion techniques are proposed. One is a method for generating a cue signal using not only video images but also microphone signals, and the other is a method for generating a cue signal using microphone signals, video images and internal knowledge. Both are a hierarchical sensor fusion of audio and visual information. In order to evaluate and demonstrate the sensor fusion algorithm, a real-time processing system including seventy DSPs was constructed. The architecture of this system is also described.<>

This publication has 9 references indexed in Scilit: