Article ID Journal Published Year Pages File Type
515833 Information Processing & Management 2014 13 Pages PDF
Abstract

•To observe features of Malay Tweets, three distinct corpus-based analyses are done.•A rule-based architecture is developed based on results of the analyses.•The architecture consists of seven distinct modules in a pipeline structure.•Experimental results indicate high accuracy in term of BLEU score.•The architecture outperforms SMT-like normalization approach.

Research in natural language processing has increasingly focused on normalizing Twitter messages. Currently, while different well-defined approaches have been proposed for the English language, the problem remains far from being solved for other languages, such as Malay. Thus, in this paper, we propose an approach to normalize the Malay Twitter messages based on corpus-driven analysis. An architecture for Malay Tweet normalization is presented, which comprises seven main modules: (1) enhanced tokenization, (2) In-Vocabulary (IV) detection, (3) specialized dictionary query, (4) repeated letter elimination, (5) abbreviation adjusting, (6) English word translation, and (7) de-tokenization. A parallel Tweet dataset, consisting of 9000 Malay Tweets, is used in the development and testing stages. To measure the performance of the system, an evaluation is carried out. The result is promising whereby we score 0.83 in BLEU against the baseline BLEU, which scores 0.46. To compare the accuracy of the architecture with other statistical approaches, an SMT-like normalization system is implemented, trained, and evaluated with an identical parallel dataset. The experimental results demonstrate that we achieve higher accuracy by the normalization system, which is designed based on the features of Malay Tweets, compared to the SMT-like system.

Related Topics
Physical Sciences and Engineering Computer Science Computer Science Applications
Authors
, , ,