Article ID | Journal | Published Year | Pages | File Type |
---|---|---|---|---|
6902137 | Procedia Computer Science | 2017 | 6 Pages |
Abstract
Twitter, is considered as one of the famous social networking platform. It has become a very valuable information source for many Natural Language Processing (NLP) applications. Some strategies and linguistic pipelines were developed for analyzing English tweets but Arabic social media analysis is still an active research area. In this research paper, we focus on the task of pre-processing Arabic tweets, which can be regarded as a first step for any NLP application. We follow up with a statistical machine translation for Arabic tweets into English, where we explain the normalization process for both Arabic and English tweets. Moreover, to overcome the obstacle of unavailability of Arabic-English parallel corpora in the social media context, we used the UN corpus, a more general corpus in (Modern Standard Arabic and English). Then, we applied adapting strategies for the tweet's contents like using an out-of-domain and/or in-domain language model. Our conducted experiments showed that applying a good lexical normalization on both languages and combining in-domain and out-of-domain data for the language model improves the Bleu score with 4pt., over the baseline.
Related Topics
Physical Sciences and Engineering
Computer Science
Computer Science (General)
Authors
Fatma Mallek, Billal Belainine, Fatiha Sadat,