کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
515808 867098 2016 14 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
DeASCIIfication approach to handle diacritics in Turkish information retrieval
ترجمه فارسی عنوان
روش DeASCIIfication برای رسیدگی به دیاکریتیک در بازیابی اطلاعات ترکیه
کلمات کلیدی
لهجه‌ها؛ DeASCIIfier؛ ترمیم دیاکریتیک؛ ارزیابی حساس به خطر؛ پاییز؛ بازیابی اطلاعات ترکیه
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر نرم افزارهای علوم کامپیوتر
چکیده انگلیسی


• Risk-sensitive evaluation of approaches for handling diacritics in Turkish information retrieval.
• Application of diacritics restoration to Turkish information retrieval.
• Investigation of the diacritics sensitivity of stemming algorithms.

The absence of diacritics in text documents or search queries is a serious problem for Turkish information retrieval because it creates homographic ambiguity. Thus, the inappropriate handling of diacritics reduces the retrieval performance in search engines. A straightforward solution to this problem is to normalize tokens by replacing diacritic characters with their American Standard Code for Information Interchange (ASCII) counterparts. However, this so-called ASCIIfication produces either synthetic words that are not legitimate Turkish words or legitimate words with meanings that are completely different from those of the original words. These non-valid synthetic words cannot be processed by morphological analysis components (such as stemmers or lemmatizers), which expect the input to be valid Turkish words. By contrast, synthetic words are not a problem when no stemmer or a simple first-n-characters-stemmer is used in the text analysis pipeline. This difference emphasizes the notion of the diacritic sensitivity of stemmers. In this study, we propose and evaluate an alternative solution based on the application of deASCIIfication, which restores accented letters in query terms or text documents. Our risk-sensitive evaluation results showed that the diacritics restoration approach yielded more effective and robust results compared with normalizing tokens to remove diacritics.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Information Processing & Management - Volume 52, Issue 2, March 2016, Pages 326–339
نویسندگان
,