Data driven articulatory synthesis with deep neural networks

Article ID	Journal	Published Year	Pages	File Type
558210	Computer Speech & Language	2016	14 Pages	PDF

Abstract

•We present an articulatory-to-acoustic mapping for real-time articulatory synthesis.•The method uses a deep neural network with a tapped-delay input line.•Tapped-delay line efficiently captures dynamics in articulatory trajectories.•The model achieved higher accuracy than competing models based on Gaussian mixtures.•The improvement was also found perceivable in a subjective listening test.

The conventional approach for data-driven articulatory synthesis consists of modeling the joint acoustic-articulatory distribution with a Gaussian mixture model (GMM), followed by a post-processing step that optimizes the resulting acoustic trajectories. This final step can significantly improve the accuracy of the GMM frame-by-frame mapping but is computationally intensive and requires that the entire utterance be synthesized beforehand, making it unsuited for real-time synthesis. To address this issue, we present a deep neural network (DNN) articulatory synthesizer that uses a tapped-delay input line, allowing the model to capture context information in the articulatory trajectory without the need for post-processing. We characterize the DNN as a function of the context size and number of hidden layers, and compare it against two GMM articulatory synthesizers, a baseline model that performs a simple frame-by-frame mapping, and a second model that also performs trajectory optimization. Our results show that a DNN with a 60-ms context window and two 512-neuron hidden layers can synthesize speech at four times the frame rate – comparable to frame-by-frame mappings, while improving the accuracy of trajectory optimization (a 9.8% reduction in Mel Cepstral distortion). Subjective evaluation through pairwise listening tests also shows a strong preference toward the DNN articulatory synthesizer when compared to GMM trajectory optimization.

Keywords

Articulatory synthesis Gaussian mixture models Electromagnetic articulography Deep learning