Article ID Journal Published Year Pages File Type
6960878 Speech Communication 2017 21 Pages PDF
Abstract
Speech-driven head movement methods are motivated by the strong coupling that exists between head movements and speech, providing an appealing solution to create behaviors that are timely synchronized with speech. This paper offers solutions for two of the problems associated with these methods. First, speech-driven methods require all the potential utterances of the conversational agent (CA) to be recorded, which limits their applications. Using existing text to speech (TTS) systems scales the applications of these methods by providing the flexibility of using text instead of pre-recorded speech. However, simply training speech-driven models with natural speech, and testing them with synthetic speech creates a mismatch affecting the performance of the system. This paper proposes a novel strategy to solve this mismatch. The proposed approach starts by creating a parallel corpus either with neutral or emotional synthetic speech timely aligned with the original speech for which we have the motion capture recordings. This parallel corpus is used to retrain the models from scratch, or adapt the models originally built with natural speech. Both subjective and objective evaluations show the effectiveness of this solution in reducing the mismatch. Second, creating head movement with speech-driven methods can disregard the meaning of the message, even when the movements are perfectly synchronized with speech. The trajectory of head movements in conversations also has a role in conveying meaning (e.g. head nods for acknowledgment). In fact, our analysis reveals that head movements under different discourse functions have distinguishable patterns. Building on the best models driven by synthetic speech, we propose to extract dialog acts directly from the text and use this information to directly constrain our models. Compared to the unconstrained model, the model generates head motion sequences that not only are closer to the statistical patterns of the original head movements, but also are perceived as more natural and appropriate.
Related Topics
Physical Sciences and Engineering Computer Science Signal Processing
Authors
, , ,