Article ID Journal Published Year Pages File Type
9673495 Speech Communication 2005 20 Pages PDF
Abstract
A corpus-based method of generating fundamental frequency (F0) contours from text was developed for Japanese. Instead of directly predicting F0 values, the method predicts command values of the F0 contour generation process model using binary decision trees. Since the model controls the F0 movement in word or in longer units, sudden undulations, unlikely in natural utterances, can be avoided even in the case of erroneous prediction. The method includes a scheme of extracting the model commands from given F0 contours, which makes it possible to prepare the corpora for training the binary decision trees automatically. Since accuracy of the extracted model commands in the training corpora is crucial for the method, constraints are applied on the location of commands. Although the method can generate any speaking styles if the corpora of the styles are available, this paper is aimed at realizing three types of emotional speech (anger, joy, and sadness) besides calm speech. The mismatches between the predicted and target contours for angry speech were similar to those for calm speech. Synthesis of emotional speech was then conducted. Phoneme durations were predicted in a similar corpus-based method, and segmental features were generated using an HMM-based speech synthesizer. A perceptual experiment was conducted for the synthesized speech, and the result indicated that anger could be conveyed well by the developed method. The result was less satisfactory for joy and sadness.
Related Topics
Physical Sciences and Engineering Computer Science Signal Processing
Authors
, , , ,