کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
534280 870241 2014 8 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Statistical machine translation of subtitles for highly inflected language pair
ترجمه فارسی عنوان
ترجمه ماشین آماری از زیرنویس برای جفت زبان بسیار فشرده
کلمات کلیدی
ترجمه ماشین آماری، ترجمه مبتنی بر عبارت، زبانهای بسیار پررنگ فرهنگ لغت دو زبانه آنتروپی
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر چشم انداز کامپیوتر و تشخیص الگو
چکیده انگلیسی


• We present a study on phrase-based statistical machine translation between two closely related languages.
• Linguistic information improves translation quality when dealing with highly inflected languages.
• The integration of a dictionary is beneficial in lemma translation component of SMT (statistical machine translation) system.
• SMT produces useful translations in subtitle domain.

This paper addresses the problem of statistical machine translation between highly inflected languages. Even when dealing with closely-related language pairs, statistical machine translation encounters problems if the parallel corpus is not big enough. To reduce the problem of data sparsity, we use the approach called factored translation, which has proven successful when translating between English and a morphologically rich language. We show that it is even more useful when translating between two highly inflected languages. The main contribution of the paper involves two extensions of the factored translation approach. First, we propose a new, more general asynchronous framework for training translation components, where lemmas in the lemma component and MSD tags in the MSD component are aligned independently of alignment done for surface word forms. The second contribution of the paper is a new technique for efficient use of a bilingual dictionary in the translation process. A dictionary is introduced into the lemma component to improve lexical translation. Dictionary use is based on entropy. We tested our enhanced translation approach on the Slovenian–Serbian language pair. The system was trained on a freely available OpenSubtitle corpus. The results show improvements in automatic scores (BLEU and TER). The approach could be used for other language pairs, especially if one or both are highly inflected.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Pattern Recognition Letters - Volume 46, 1 September 2014, Pages 96–103
نویسندگان
, , ,