Evaluating spoken dialogue systems according to de-facto standards: A case study

Article ID	Journal	Published Year	Pages	File Type
558424	Computer Speech & Language	2007	28 Pages	PDF

Abstract

In the present paper, we investigate the validity and reliability of de-facto evaluation standards, defined for measuring or predicting the quality of the interaction with spoken dialogue systems. Two experiments have been carried out with a dialogue system for controlling domestic devices. During these experiments, subjective judgments of quality have been collected by two questionnaire methods (ITU-T Rec. P.851 and SASSI), and parameters describing the interaction have been logged and annotated. Both metrics served the derivation of prediction models according to the PARADISE approach. Although the limited database allows only tentative conclusions to be drawn, the results suggest that both questionnaire methods provide valid measurements of a large number of different quality aspects; most of the perceptive dimensions underlying the subjective judgments can also be measured with a high reliability. The extracted parameters mainly describe quality aspects which are directly linked to the system, environmental and task characteristics. Used as an input to prediction models, the parameters provide helpful information for system design and optimization, but not general predictions of system usability and acceptability.