Improving connected letter recognition by lipreading

Abstract
The authors show how recognition performance in automated speech perception can be significantly improved by additional lipreading, so called speech-reading. They show this on an extension of a state-of-the-art speech recognition system, a modular multistage time delay neural network architecture (MS-TDNN). The acoustic and visual speech data are preclassified in two separate front-end phoneme TDNNs and combined with acoustic-visual hypotheses for the dynamic time warping algorithm. This is shown on a connected word recognition problem, the notoriously difficult letter spelling task. With speech-reading, the error rate could be reduced by up to half of the error rate of the pure acoustic recognition.

This publication has 8 references indexed in Scilit: