کد مقاله | کد نشریه | سال انتشار | مقاله انگلیسی | نسخه تمام متن |
---|---|---|---|---|
567033 | 1452042 | 2014 | 18 صفحه PDF | دانلود رایگان |
![عکس صفحه اول مقاله: SMT-based ASR domain adaptation methods for under-resourced languages: Application to Romanian SMT-based ASR domain adaptation methods for under-resourced languages: Application to Romanian](/preview/png/567033.png)
• The first large-vocabulary Romanian ASR system is presented.
• Phonetization and diacritics restoration systems for Romanian are introduced.
• An innovative ASR domain-adaptation methodology based on SMT is proposed.
• The semi-supervised adaptation methods are shown to improve ASR performance.
This study investigates the possibility of using statistical machine translation to create domain-specific language resources. We propose a methodology that aims to create a domain-specific automatic speech recognition (ASR) system for a low-resourced language when in-domain text corpora are available only in a high-resourced language. Several translation scenarios (both unsupervised and semi-supervised) are used to obtain domain-specific textual data. Moreover this paper shows that a small amount of manually post-edited text is enough to develop other natural language processing systems that, in turn, can be used to automatically improve the machine translated text, leading to a significant boost in ASR performance. An in-depth analysis, to explain why and how the machine translated text improves the performance of the domain-specific ASR, is also made at the end of this paper. As bi-products of this core domain-adaptation methodology, this paper also presents the first large vocabulary continuous speech recognition system for Romanian, and introduces a diacritics restoration module to process the Romanian text corpora, as well as an automatic phonetization module needed to extend the Romanian pronunciation dictionary.
Journal: Speech Communication - Volume 56, January 2014, Pages 195–212