Unit selection in a concatenative speech synthesis system using a large speech database

1 May 1996

proceedings article
Published by Institute of Electrical and Electronics Engineers (IEEE)

Vol. 1, 373-376
https://doi.org/10.1109/icassp.1996.541110

Abstract

One approach to the generation of natural-sounding synthesized speech waveforms is to select and concatenate units from a large speech database. Units (in the current work, phonemes) are selected to produce a natural realisation of a target phoneme sequence predicted from text which is annotated with prosodic and phonetic context information. We propose that the units in a synthesis database can be considered as a state transition network in which the state occupancy cost is the distance between a database unit and a target, and the transition cost is an estimate of the quality of concatenation of two consecutive units. This framework has many similarities to HMM-based speech recognition. A pruned Viterbi search is used to select the best units for synthesis from the database. This approach to waveform synthesis permits training from natural speech: two methods for training from speech are presented which provide weights which produce more natural speech than can be obtained by hand-tuning.

Keywords

This publication has 4 references indexed in Scilit:

Optimising selection of units from speech databases for concatenative synthesis
Published by International Speech Communication Association ,1995
Concatenative speech synthesis by minimum distortion criteria
Published by Institute of Electrical and Electronics Engineers (IEEE) ,1992
Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones
Speech Communication, 1990
A tutorial on hidden Markov models and selected applications in speech recognition
Proceedings of the IEEE, 1989