Unit selection in a concatenative speech synthesis system using a large speech database
- 1 May 1996
- proceedings article
- Published by Institute of Electrical and Electronics Engineers (IEEE)
- Vol. 1, 373-376
- https://doi.org/10.1109/icassp.1996.541110
Abstract
One approach to the generation of natural-sounding synthesized speech waveforms is to select and concatenate units from a large speech database. Units (in the current work, phonemes) are selected to produce a natural realisation of a target phoneme sequence predicted from text which is annotated with prosodic and phonetic context information. We propose that the units in a synthesis database can be considered as a state transition network in which the state occupancy cost is the distance between a database unit and a target, and the transition cost is an estimate of the quality of concatenation of two consecutive units. This framework has many similarities to HMM-based speech recognition. A pruned Viterbi search is used to select the best units for synthesis from the database. This approach to waveform synthesis permits training from natural speech: two methods for training from speech are presented which provide weights which produce more natural speech than can be obtained by hand-tuning.Keywords
This publication has 4 references indexed in Scilit:
- Optimising selection of units from speech databases for concatenative synthesisPublished by International Speech Communication Association ,1995
- Concatenative speech synthesis by minimum distortion criteriaPublished by Institute of Electrical and Electronics Engineers (IEEE) ,1992
- Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Communication, 1990
- A tutorial on hidden Markov models and selected applications in speech recognitionProceedings of the IEEE, 1989