Interacting with computers by voice: automatic speech recognition and synthesis

8 September 2003

journal article
review article
Published by Institute of Electrical and Electronics Engineers (IEEE) in Proceedings of the IEEE

Vol. 91 (9) , 1272-1305
https://doi.org/10.1109/jproc.2003.817117

Abstract

This paper examines how people communicate with computers using speech. Automatic speech recognition (ASR) transforms speech into text, while automatic speech synthesis [or text-to-speech (TTS)] performs the reverse task. ASR has been largely developed based on speech coding theory, while simulating certain spectral analyses performed by the ear. Typically, a Fourier transform is employed, but following the auditory Bark scale and simplifying the spectral representation with a decorrelation into cepstral coefficients. Current ASR provides good accuracy and performance on limited practical tasks, but exploits only the most rudimentary knowledge about human production and perception phenomena. The popular mathematical model called the hidden Markov model (HMM) is examined; first-order HMMs are efficient but ignore long-range correlations in actual speech. Common language models use a time window of three successive words in their syntactic-semantic analysis. Speech synthesis is the automatic generation of a speech waveform, typically from an input text. As with ASR, TTS starts from a database of information previously established by analysis of much training data, both speech and text. Previously analyzed speech is stored in small units in the database, for concatenation in the proper sequence at runtime. TTS systems first perform text processing, including "letter-to-sound" conversion, to generate the phonetic transcription. Intonation must be properly specified to approximate the naturalness of human speech. Modern synthesizers using large databases of stored spectral patterns or waveforms output highly intelligible synthetic speech, but naturalness remains to be improved.

Keywords

This publication has 246 references indexed in Scilit:

Cluster adaptive training of hidden Markov models
IEEE Transactions on Speech and Audio Processing, 2000
Modeling long distance dependence in language: topic mixtures versus dynamic cache models
IEEE Transactions on Speech and Audio Processing, 1999
Glottal characteristics of female speakers: Acoustic correlates
The Journal of the Acoustical Society of America, 1997
High-performance alphabet recognition
IEEE Transactions on Speech and Audio Processing, 1996
A Parallel Implementation of a Hidden Markov Model with Duration Modeling for Speech Recognition
Digital Signal Processing, 1995
Techniques for estimating vocal-tract shapes from the speech signal
IEEE Transactions on Speech and Audio Processing, 1994
An unrestricted vocabulary Arabic speech synthesis system
IEEE Transactions on Acoustics, Speech, and Signal Processing, 1989
A tutorial on hidden Markov models and selected applications in speech recognition
Proceedings of the IEEE, 1989
Automatic speech recognition using psychoacoustic models
The Journal of the Acoustical Society of America, 1979
Review of the ARPA Speech Understanding Project
The Journal of the Acoustical Society of America, 1977