Article ID Journal Published Year Pages File Type
567033 Speech Communication 2014 18 Pages PDF
Abstract

•The first large-vocabulary Romanian ASR system is presented.•Phonetization and diacritics restoration systems for Romanian are introduced.•An innovative ASR domain-adaptation methodology based on SMT is proposed.•The semi-supervised adaptation methods are shown to improve ASR performance.

This study investigates the possibility of using statistical machine translation to create domain-specific language resources. We propose a methodology that aims to create a domain-specific automatic speech recognition (ASR) system for a low-resourced language when in-domain text corpora are available only in a high-resourced language. Several translation scenarios (both unsupervised and semi-supervised) are used to obtain domain-specific textual data. Moreover this paper shows that a small amount of manually post-edited text is enough to develop other natural language processing systems that, in turn, can be used to automatically improve the machine translated text, leading to a significant boost in ASR performance. An in-depth analysis, to explain why and how the machine translated text improves the performance of the domain-specific ASR, is also made at the end of this paper. As bi-products of this core domain-adaptation methodology, this paper also presents the first large vocabulary continuous speech recognition system for Romanian, and introduces a diacritics restoration module to process the Romanian text corpora, as well as an automatic phonetization module needed to extend the Romanian pronunciation dictionary.

Related Topics
Physical Sciences and Engineering Computer Science Signal Processing
Authors
, , , ,