Arabic Social Media Analysis and Translation

Article ID	Journal	Published Year	Pages	File Type
6902137	Procedia Computer Science	2017	6 Pages	PDF

Abstract

Twitter, is considered as one of the famous social networking platform. It has become a very valuable information source for many Natural Language Processing (NLP) applications. Some strategies and linguistic pipelines were developed for analyzing English tweets but Arabic social media analysis is still an active research area. In this research paper, we focus on the task of pre-processing Arabic tweets, which can be regarded as a first step for any NLP application. We follow up with a statistical machine translation for Arabic tweets into English, where we explain the normalization process for both Arabic and English tweets. Moreover, to overcome the obstacle of unavailability of Arabic-English parallel corpora in the social media context, we used the UN corpus, a more general corpus in (Modern Standard Arabic and English). Then, we applied adapting strategies for the tweet's contents like using an out-of-domain and/or in-domain language model. Our conducted experiments showed that applying a good lexical normalization on both languages and combining in-domain and out-of-domain data for the language model improves the Bleu score with 4pt., over the baseline.

Keywords

machine translation tweets Twitter Arabic