کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
558592 874953 2009 20 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Arabic diacritic restoration approach based on maximum entropy models
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر پردازش سیگنال
پیش نمایش صفحه اول مقاله
Arabic diacritic restoration approach based on maximum entropy models
چکیده انگلیسی

In modern standard Arabic and in dialectal Arabic texts, short vowels and other diacritics are omitted. Exceptions are made for important political and religious texts and in scripts for beginning students of Arabic. Scripts without diacritics have considerable ambiguity because many words with different diacritic patterns appear identical in a diacritic-less setting. In this paper we present a maximum entropy approach for restoring short vowels and other diacritics in an Arabic document. The approach can easily integrate and make effective use of diverse types of information; the model we propose integrates a wide array of lexical, segment-based and part-of-speech tag features. The combination of these feature types leads to a high-performance diacritic restoration model. Using a publicly available corpus (LDC’s Arabic Treebank Part 3), we achieve a diacritic error rate of 5.1%, a segment error rate 8.5%, and a word error rate of 17.3%. In case-ending-less setting, we obtain a diacritic error rate of 2.2%, a segment error rate of 4.0%, and a word error rate of 7.2%. We also show in this paper a comparison of our approach to previously published techniques and we demonstrate the effectiveness of this technique in restoring diacritics in different kind of data such as the dialectal Iraqi Arabic scripts.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Computer Speech & Language - Volume 23, Issue 3, July 2009, Pages 257–276
نویسندگان
, ,