کد مقاله | کد نشریه | سال انتشار | مقاله انگلیسی | نسخه تمام متن |
---|---|---|---|---|
378697 | 659205 | 2015 | 15 صفحه PDF | دانلود رایگان |
• We perform Arabic morphological disambiguation on unlabeled vocalized corpora.
• We experiment possibilistic measures for imprecise morphological data classification.
• We assess the impact of a reweighting model and the integration of linguistic rules.
• We propose and evaluate an approach for Out-of-Vocabulary word disambiguation.
Morphological ambiguity is an important phenomenon affecting several tasks in Arabic text analysis, indexing and mining. Nevertheless, it has not been well studied in related works. We investigate, in this paper, new approaches to disambiguate the morphological features of non-vocalized Arabic texts, combining statistical classification and linguistic rules. Indeed, we perform unsupervised training from unlabelled vocalized Arabic corpora. Thus, the training and testing sets contain imperfect instances (i.e. having ambiguous attributes and/or classes). To handle imperfect data, we compare two approaches: i) a possibilistic approach allowing to handle imperfection in a direct manner; and, ii) a data transformation-based approach permitting to convert an imperfect dataset to a perfect one, thus allowing to exploit classical classifiers. We also present an approach dealing with unknown (Out-of-Vocabulary) words. The experiments focus mainly on classical texts, which were not sufficiently studied in related works. We show that the possibilistic approach performs better than the transformation-based one. Besides, we report encouraging results as far as i) the role of linguistic rules in enhancing the disambiguation rates; and, ii) the accuracy of our approach for full morphological disambiguation of unknown words.
Journal: Data & Knowledge Engineering - Volume 100, Part B, November 2015, Pages 240–254