Article ID | Journal | Published Year | Pages | File Type |
---|---|---|---|---|
565885 | Speech Communication | 2015 | 14 Pages |
•DNN-HMMs outperform GMM-HMMs by a large margin for all spoken assessment tasks.•Open-ended tasks benefit far more than constrained tasks from the use of DNN-HMMs.•For open-ended tasks, DNN-HMMs can take full advantage of increasing training data.•The performance of constrained tasks saturates at around 25 h of training data.•Constrained tasks require only a few hours of data to build well-performing models.
In this paper, we investigate the effectiveness of applying deep neural network hidden Markov models, or DNN-HMMs, for acoustic modeling in the context of educational applications. Specifically, we focus on spoken responses from non-native and child speech that tend to show great acoustic variability. We perform comprehensive experiments to compare the performance between traditional Gaussian mixture model (GMM)-HMMs and DNN-HMMs in three large language assessment datasets that contain various spoken tasks, classified broadly as constrained and open-ended tasks. Our experimental results suggest useful conclusions that can help guide the design of real-life educational applications. DNN-HMMs outperform conventional GMM-HMMs by a large margin for all spoken tasks commonly used in spoken assessment applications. In our experiments, DNN-HMMs trained using 25 h of data can outperform GMM-HMMs trained with 6.7–9 times data. Specifically regarding overall performance, when all available training data were used (175, 227, 169 h respectively), we achieved a relative word error rate decrease of 20.4% for adult English and 29.3% for child English, and a relative character error rate decrease of 14.3% for adult Chinese, when switching from GMMs to DNNs. In comparing between types of tasks, we notice that the more challenging open-ended tasks benefit significantly more than constrained item types by the use of DNN-HMMs. For open-ended tasks, having large amounts of training data is the key, as DNN-HMMs can take full advantage of the added training data and further push performance. In contrast, the performance of constrained spoken tasks saturates at around 25 h of training data. At the same time, constrained spoken tasks require only a few hours of data (1 or 5 h) to build well-performing acoustic models. This is an encouraging observation, that indicates the potential to build reliable spoken assessment applications based on constrained tasks, when few domain specific training data are available.