Synthesis of F0 contours using generation process model parameters predicted from unlabeled corpora: application to emotional speech synthesis

Article ID	Journal	Published Year	Pages	File Type
9673495	Speech Communication	2005	20 Pages	PDF

Abstract

A corpus-based method of generating fundamental frequency (F0) contours from text was developed for Japanese. Instead of directly predicting F0 values, the method predicts command values of the F0 contour generation process model using binary decision trees. Since the model controls the F0 movement in word or in longer units, sudden undulations, unlikely in natural utterances, can be avoided even in the case of erroneous prediction. The method includes a scheme of extracting the model commands from given F0 contours, which makes it possible to prepare the corpora for training the binary decision trees automatically. Since accuracy of the extracted model commands in the training corpora is crucial for the method, constraints are applied on the location of commands. Although the method can generate any speaking styles if the corpora of the styles are available, this paper is aimed at realizing three types of emotional speech (anger, joy, and sadness) besides calm speech. The mismatches between the predicted and target contours for angry speech were similar to those for calm speech. Synthesis of emotional speech was then conducted. Phoneme durations were predicted in a similar corpus-based method, and segmental features were generated using an HMM-based speech synthesizer. A perceptual experiment was conducted for the synthesized speech, and the result indicated that anger could be conveyed well by the developed method. The result was less satisfactory for joy and sadness.

Keywords

HMM-based speech synthesis