SMT-based ASR domain adaptation methods for under-resourced languages: Application to Romanian

Article ID	Journal	Published Year	Pages	File Type
567033	Speech Communication	2014	18 Pages	PDF

Abstract

•The first large-vocabulary Romanian ASR system is presented.•Phonetization and diacritics restoration systems for Romanian are introduced.•An innovative ASR domain-adaptation methodology based on SMT is proposed.•The semi-supervised adaptation methods are shown to improve ASR performance.

This study investigates the possibility of using statistical machine translation to create domain-specific language resources. We propose a methodology that aims to create a domain-specific automatic speech recognition (ASR) system for a low-resourced language when in-domain text corpora are available only in a high-resourced language. Several translation scenarios (both unsupervised and semi-supervised) are used to obtain domain-specific textual data. Moreover this paper shows that a small amount of manually post-edited text is enough to develop other natural language processing systems that, in turn, can be used to automatically improve the machine translated text, leading to a significant boost in ASR performance. An in-depth analysis, to explain why and how the machine translated text improves the performance of the domain-specific ASR, is also made at the end of this paper. As bi-products of this core domain-adaptation methodology, this paper also presents the first large vocabulary continuous speech recognition system for Romanian, and introduces a diacritics restoration module to process the Romanian text corpora, as well as an automatic phonetization module needed to extend the Romanian pronunciation dictionary.

Keywords

Domain adaptation statistical machine translation Automatic speech recognition Under-resourced languages Language modeling