کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
534077 870216 2012 8 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
A high performance centroid-based classification approach for language identification
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر چشم انداز کامپیوتر و تشخیص الگو
پیش نمایش صفحه اول مقاله
A high performance centroid-based classification approach for language identification
چکیده انگلیسی

Centroid-based classification is a machine learning approach used in the text classification domain. The main advantage of centroid-based classifiers is their high performance during both the training stage and the classification stage. However, the success rate can be lower than the other classifiers if good centroid values are not used. In this paper, we apply the centroid-based classification method to the language identification problem, which can be considered as a sub-problem of text classification. We propose a novel method named as inverse class frequency to increase the quality of the centroid values, which involves an update of the classical values. We also use a feature set formed of individual characters rather than words or n-gram sequences to decrease the training and classification times. The experiments were performed on the ECI/MCI corpus and the method was compared with other methods and previous studies. The results showed that the proposed approach yields high success rates and works very efficiently for language identification.


► High performance language identification is still an open problem.
► One solution for high performance identification is centroid-based classification.
► We use a low-sized feature set and a centroid based classifier in this work.
► The results obtained outperform other classical methods.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Pattern Recognition Letters - Volume 33, Issue 16, 1 December 2012, Pages 2077–2084
نویسندگان
, ,