Article ID Journal Published Year Pages File Type
6902137 Procedia Computer Science 2017 6 Pages PDF
Abstract
Twitter, is considered as one of the famous social networking platform. It has become a very valuable information source for many Natural Language Processing (NLP) applications. Some strategies and linguistic pipelines were developed for analyzing English tweets but Arabic social media analysis is still an active research area. In this research paper, we focus on the task of pre-processing Arabic tweets, which can be regarded as a first step for any NLP application. We follow up with a statistical machine translation for Arabic tweets into English, where we explain the normalization process for both Arabic and English tweets. Moreover, to overcome the obstacle of unavailability of Arabic-English parallel corpora in the social media context, we used the UN corpus, a more general corpus in (Modern Standard Arabic and English). Then, we applied adapting strategies for the tweet's contents like using an out-of-domain and/or in-domain language model. Our conducted experiments showed that applying a good lexical normalization on both languages and combining in-domain and out-of-domain data for the language model improves the Bleu score with 4pt., over the baseline.
Related Topics
Physical Sciences and Engineering Computer Science Computer Science (General)
Authors
, , ,