Deep neural network acoustic models for spoken assessment applications

Article ID	Journal	Published Year	Pages	File Type
565885	Speech Communication	2015	14 Pages	PDF

Abstract

•DNN-HMMs outperform GMM-HMMs by a large margin for all spoken assessment tasks.•Open-ended tasks benefit far more than constrained tasks from the use of DNN-HMMs.•For open-ended tasks, DNN-HMMs can take full advantage of increasing training data.•The performance of constrained tasks saturates at around 25 h of training data.•Constrained tasks require only a few hours of data to build well-performing models.

In this paper, we investigate the effectiveness of applying deep neural network hidden Markov models, or DNN-HMMs, for acoustic modeling in the context of educational applications. Specifically, we focus on spoken responses from non-native and child speech that tend to show great acoustic variability. We perform comprehensive experiments to compare the performance between traditional Gaussian mixture model (GMM)-HMMs and DNN-HMMs in three large language assessment datasets that contain various spoken tasks, classified broadly as constrained and open-ended tasks. Our experimental results suggest useful conclusions that can help guide the design of real-life educational applications. DNN-HMMs outperform conventional GMM-HMMs by a large margin for all spoken tasks commonly used in spoken assessment applications. In our experiments, DNN-HMMs trained using 25 h of data can outperform GMM-HMMs trained with 6.7–9 times data. Specifically regarding overall performance, when all available training data were used (175, 227, 169 h respectively), we achieved a relative word error rate decrease of 20.4% for adult English and 29.3% for child English, and a relative character error rate decrease of 14.3% for adult Chinese, when switching from GMMs to DNNs. In comparing between types of tasks, we notice that the more challenging open-ended tasks benefit significantly more than constrained item types by the use of DNN-HMMs. For open-ended tasks, having large amounts of training data is the key, as DNN-HMMs can take full advantage of the added training data and further push performance. In contrast, the performance of constrained spoken tasks saturates at around 25 h of training data. At the same time, constrained spoken tasks require only a few hours of data (1 or 5 h) to build well-performing acoustic models. This is an encouraging observation, that indicates the potential to build reliable spoken assessment applications based on constrained tasks, when few domain specific training data are available.

Keywords

Language learning Deep neural networks Acoustic modeling