کد مقاله | کد نشریه | سال انتشار | مقاله انگلیسی | نسخه تمام متن |
---|---|---|---|---|
557986 | 1451694 | 2006 | 20 صفحه PDF | دانلود رایگان |
Automatic recognition of children’s speech using acoustic models trained by adults results in poor performance due to differences in speech acoustics. These acoustical differences are a consequence of children having shorter vocal tracts and smaller vocal cords than adults. Hence, speaker adaptation needs to be performed. However, in real-world applications, the amount of adaptation data available may be less than what is needed by common speaker adaptation techniques to yield reasonable performance. In this paper, we first study, in the discrete frequency domain, the relationship between frequency warping in the front-end and corresponding transformations in the back-end. Three common feature extraction schemes are investigated and their transformation linearity in the back-end are discussed. In particular, we show that under certain approximations, frequency warping of MFCC features with Mel-warped triangular filter banks equals a linear transformation in the cepstral space. Based on that linear transformation, a formant-like peak alignment algorithm is proposed to adapt adult acoustic models to children’s speech. The peaks are estimated by Gaussian mixtures using the Expectation-Maximization (EM) algorithm [Zolfaghari, P., Robinson, T., 1996. Formant analysis using mixtures of Gaussians, Proceedings of International Conference on Spoken Language Processing, 1229–1232]. For limited adaptation data, the algorithm outperforms traditional vocal tract length normalization (VTLN) and maximum likelihood linear regression (MLLR) techniques.
Journal: Computer Speech & Language - Volume 20, Issue 4, October 2006, Pages 400–419