کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
558222 1451691 2016 17 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Getting more from automatic transcripts for semi-supervised language modeling
ترجمه فارسی عنوان
گرفتن بیشتر از رونوشت های خودکار برای مدل سازی زبان نیمه تحت نظارت
کلمات کلیدی
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر پردازش سیگنال
چکیده انگلیسی


• We analyze why semi-supervised backoff language modeling performs poorly.
• We motivate MAP adaptation of a log-linear language model.
• We use automatic transcripts as a prior for language model estimation.
• We show consistent reduction in WER across a range of low-resource conditions.

Many under-resourced languages such as Arabic diglossia or Hindi sub-dialects do not have sufficient in-domain text to build strong language models for use with automatic speech recognition (ASR). Semi-supervised language modeling uses a speech-to-text system to produce automatic transcripts from a large amount of in-domain audio typically to augment a small amount of manual transcripts. In contrast to the success of semi-supervised acoustic modeling, conventional language modeling techniques have provided only modest gains. This paper first explains the limitations of back-off language models due to their dependence on long-span n-grams, which are difficult to accurately estimate from automatic transcripts. From this analysis, we motivate a more robust use of the automatic counts as a prior over the estimated parameters of a log-linear language model. We demonstrate consistent gains for semi-supervised language models across a range of low-resource conditions.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Computer Speech & Language - Volume 36, March 2016, Pages 93–109
نویسندگان
, , ,