Comparing ANN and GMM in a voice conversion framework

Article ID	Journal	Published Year	Pages	File Type
496412	Applied Soft Computing	2012	11 Pages	PDF

Abstract

In this paper, we present a comparative analysis of artificial neural networks (ANNs) and Gaussian mixture models (GMMs) for design of voice conversion system using line spectral frequencies (LSFs) as feature vectors. Both the ANN and GMM based models are explored to capture nonlinear mapping functions for modifying the vocal tract characteristics of a source speaker according to a desired target speaker. The LSFs are used to represent the vocal tract transfer function of a particular speaker. Mapping of the intonation patterns (pitch contour) is carried out using a codebook based model at segmental level. The energy profile of the signal is modified using a fixed scaling factor defined between the source and target speakers at the segmental level. Two different methods for residual modification such as residual copying and residual selection methods are used to generate the target residual signal. The performance of ANN and GMM based voice conversion (VC) system are conducted using subjective and objective measures. The results indicate that the proposed ANN-based model using LSFs feature set may be used as an alternative to state-of-the-art GMM-based models used to design a voice conversion system.

Graphical abstractThe desired spectral envelope along with the predicted spectral envelopes using baseline Gaussian mixture model (GMM) and proposed 5-layer feed forward neural network for mapping the vocal tract characteristics of a source speaker according to a desired target speaker for voice conversion. Figure optionsDownload full-size imageDownload as PowerPoint slideHighlights► Database of four Indian speakers has been developed to design voice conversion system. ► 5-Layer ANN-based model has been proposed for vocal tract modification in a voice conversion framework. ► Spectral mapping using ANN perform equally well as that of the conventional GMM based model. ► Epoch based codebook model for pitch contour modification can capture the patterns of the desired target pitch contour. ► LP residual selection and LP residual copying methods are compared in terms speaker's identity and quality conversion.

Keywords

artificial neural networks Gaussian mixture models Energy profiles Prosody Pitch contour