کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
4944430 1437990 2017 17 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Wikipedia-based cross-language text classification
ترجمه فارسی عنوان
طبقه بندی متنی متقابل زبان مبتنی بر ویکیپدیا
کلمات کلیدی
طبقه بندی متنی متقابل، ویکیپدیا معدنچی کیسه ای از مفاهیم، کیسه ای از کلمات، ترکیبی، نمایندگی سند،
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر هوش مصنوعی
چکیده انگلیسی
This paper presents the application of a Wikipedia-based bag of concepts (WikiBoC) document representation to cross-language text classification (CLTC). Its main objective is to alleviate the major drawbacks of the state-of-the-art CLTC approaches - typically based on the machine translation (MT) of documents, which are represented as bags of words (BoW). We propose a technique called cross-language concept matching (CLCM), to convert concept-based representations of documents from one language to another using Wikipedia correspondences between concepts in different languages and thus not relying on automated full-text translations. We describe two proposals: the first proposal consists in the use of the WikiBoC representation in conjunction with the CLCM technique (WikiBoC-CLCM) to classify documents written in a language L1 by using a SVM algorithm that was trained with documents written in another language L2; the second proposal consists of a hybrid model for representing documents that combines WikiBoC-CLCM with the classic BoW-MT approach. To evaluate the two proposals we conducted several experiments with three cross-lingual corpora: the JRC-Acquis corpus and two purpose-built corpora composed of Wikipedia articles. The first proposal outperforms state-of-the-art approaches when training sequences are short, achieving performance increases up to 233.33%. The second proposal outperforms state-of-the-art approaches in the whole range of training sequences, achieving performance increases up to 23.78%. Results obtained show the benefits of the WikiBoC-CLCM approach, since concepts extracted from documents add useful information to the classifier, thus improving its performance.
ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Information Sciences - Volumes 406–407, September 2017, Pages 12-28
نویسندگان
, , ,