Combining human and automated scores for the improved assessment of non-native speech

Article ID	Journal	Published Year	Pages	File Type
4977795	Speech Communication	2017	10 Pages	PDF

Abstract

In this study, we propose an efficient way to combine human and automated scoring to increase the reliability and validity of a system used to assess spoken responses in the context of an international English language assessment. A set of filtering systems are used to automatically identify various classes of spoken responses that are difficult to score with an automated scoring system, for example, due to a high level of noise or imperfections in components of the overall system. Finally, these flagged responses are then routed to and scored by human raters. The vast majority of responses are not flagged by the filtering system and receive scores by the automated scoring system, resulting in a hybrid scoring approach. The overall hybrid speech scoring system presented here is comprised of multiple subprocesses, including the recording of spoken responses, transcription generation based on an automated speech recognizer, linguistic feature generation, filtering of problematic responses, automated score generation, human rater scoring, and final score combination. We evaluate this scoring approach with pilot data from a novel international English proficiency assessment. It achieves a substantial improvement in scoring performance and score validity with a limited amount of human scoring and most responses scored automatically; the correlation between the baseline system (baseline filtering with imputation) with human raters' scores is 0.72, and using an extended filtering model, the performance improves to 0.82. The improvement can be attributed in part to the extended filtering model itself that identified more classes of non-scorableâ¯responses, and in part to the combination of machine and human scores in our hybrid system.

Keywords

Non-native speech