Article ID Journal Published Year Pages File Type
566772 Speech Communication 2015 14 Pages PDF
Abstract

•ESSEM allowed unsupervised model adaptation with only one testing utterance.•ATG mapping function enhances ESSEM framework with sufficient adaptation statistics.•Adaptation performance determined by appropriate selection of ensemble models.•Over-fitting issues handled with optimization processes: MAP, MS, and CS.•Imposing constraints on ATG allows flexible extension to LR, BC, LCB, LC, and BF.

The ensemble speaker and speaking environment modeling (ESSEM) framework was designed to provide online optimization for enhancing workable systems under real-world conditions. In the ESSEM framework, ensemble models are built in the offline phase to characterize specific environments based on local statistics prepared from those particular conditions. In the online phase, a mapping function is computed based on the incoming testing data to perform model adaptation. Previous studies utilized linear combination (LC) and linear combination with a correction bias (LCB) as simple mapping functions that only apply one weighting coefficient on each model. In order to better utilize the ensemble models, this study presents a generalized affine transform group (ATG) mapping function for the ESSEM framework. Although ATG characterizes unknown testing conditions more precisely using a larger amount of parameters, over-fitting issues occur when the available adaptation data is especially limited. This study handles over-fitting issues with three optimization processes: maximum a posteriori (MAP) criterion, model selection (MS), and cohort selection (CS). Experimental results showed that ATG along with the three optimization processes enabled the ESSEM framework to allow unsupervised model adaptation using only one utterance to provide consistent performance improvements.

Related Topics
Physical Sciences and Engineering Computer Science Signal Processing
Authors
, , , ,