Cross-lingual experiments with phone recognition

Abstract
Research on speaker-independent continuous phone recognition for both French and English is presented. The phone accuracy is assessed on the BREF corpus for French, and on the Wall Street Journal (WSJ) and TIMIT corpora for English. Cross-language differences concerning language properties are presented. It is found that French is easier to recognize at the phone level (the phone error for BREF is 23.6% vs. 30.1% for WSJ), but harder to recognize at the lexical level due to the larger number of homophones. Experiments with signal analysis indicate that a 4 kHz signal bandwidth is sufficient for French, whereas 8 kHz is needed for English. Phone recognition is a powerful technique for language, sex, and speaker identification. With 2 s of speech, the language can be identified with better than 99% accuracy. Sex-identification for BREF and WSJ is error-free. Speaker identification accuracies of 98.2% on TIMIT (462 speakers) and 99.1% on BREF (57 speakers) were obtained with one utterance per speaker. 100% accuracies were obtained with two utterances per speaker.

This publication has 10 references indexed in Scilit: