کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
515833 867108 2014 13 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
An architecture for Malay Tweet normalization
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر نرم افزارهای علوم کامپیوتر
پیش نمایش صفحه اول مقاله
An architecture for Malay Tweet normalization
چکیده انگلیسی


• To observe features of Malay Tweets, three distinct corpus-based analyses are done.
• A rule-based architecture is developed based on results of the analyses.
• The architecture consists of seven distinct modules in a pipeline structure.
• Experimental results indicate high accuracy in term of BLEU score.
• The architecture outperforms SMT-like normalization approach.

Research in natural language processing has increasingly focused on normalizing Twitter messages. Currently, while different well-defined approaches have been proposed for the English language, the problem remains far from being solved for other languages, such as Malay. Thus, in this paper, we propose an approach to normalize the Malay Twitter messages based on corpus-driven analysis. An architecture for Malay Tweet normalization is presented, which comprises seven main modules: (1) enhanced tokenization, (2) In-Vocabulary (IV) detection, (3) specialized dictionary query, (4) repeated letter elimination, (5) abbreviation adjusting, (6) English word translation, and (7) de-tokenization. A parallel Tweet dataset, consisting of 9000 Malay Tweets, is used in the development and testing stages. To measure the performance of the system, an evaluation is carried out. The result is promising whereby we score 0.83 in BLEU against the baseline BLEU, which scores 0.46. To compare the accuracy of the architecture with other statistical approaches, an SMT-like normalization system is implemented, trained, and evaluated with an identical parallel dataset. The experimental results demonstrate that we achieve higher accuracy by the normalization system, which is designed based on the features of Malay Tweets, compared to the SMT-like system.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Information Processing & Management - Volume 50, Issue 5, September 2014, Pages 621–633
نویسندگان
, , ,