Multistage data selection-based unsupervised speaker adaptation for personalized speech emotion recognition

Article ID	Journal	Published Year	Pages	File Type
380198	Engineering Applications of Artificial Intelligence	2016	9 Pages	PDF

Abstract

This paper proposes an efficient speech emotion recognition (SER) approach that utilizes personal voice data accumulated on personal devices. A representative weakness of conventional SER systems is the user-dependent performance induced by the speaker independent (SI) acoustic model framework. But, handheld communications devices such as smartphones provide a collection of individual voice data, thus providing suitable conditions for personalized SER that is more enhanced than the SI model framework. By taking advantage of personal devices, we propose an efficient personalized SER scheme employing maximum likelihood linear regression (MLLR), a representative speaker adaptation technique. To further advance the conventional MLLR technique for SER tasks, the proposed approach selects useful data that convey emotionally discriminative acoustic characteristics and uses only those data for adaptation. For reliable data selection, we conduct multistage selection using a log-likelihood distance-based measure and a universal background model. On SER experiments based on a Linguistic Data Consortium emotional speech corpus, our approach exhibited superior performance when compared to conventional adaptation techniques as well as the SI model framework.

Graphical abstractFigure optionsDownload full-size imageDownload as PowerPoint slide

Keywords

Universal background model Maximum likelihood linear regression Speaker adaptation Speech emotion recognition Acoustic model