Word co-occurrence features for text classification

کد مقاله	کد نشریه	سال انتشار	مقاله انگلیسی	نسخه تمام متن
397019	670661	2011	16 صفحه PDF	دانلود رایگان

عنوان انگلیسی مقاله ISI

دانلود مقاله + سفارش ترجمه

دانلود مقاله ISI انگلیسی

رایگان برای ایرانیان

کلمات کلیدی

Feature extraction - استخراج ویژگی Classification - طبقه بندی Text mining - متن‌کاوی

موضوعات مرتبط

مهندسی و علوم پایه مهندسی کامپیوتر هوش مصنوعی

پیش نمایش صفحه اول مقاله

Word co-occurrence features for text classification

چکیده انگلیسی

In this article we propose a data treatment strategy to generate new discriminative features, called compound-features (or c-features), for the sake of text classification. These c-features are composed by terms that co-occur in documents without any restrictions on order or distance between terms within a document. This strategy precedes the classification task, in order to enhance documents with discriminative c-features. The idea is that, when c-features are used in conjunction with single-features, the ambiguity and noise inherent to their bag-of-words representation are reduced. We use c-features composed of two terms in order to make their usage computationally feasible while improving the classifier effectiveness. We test this approach with several classification algorithms and single-label multi-class text collections. Experimental results demonstrated gains in almost all evaluated scenarios, from the simplest algorithms such as kNN (13% gain in micro-average F1 in the 20 Newsgroups collection) to the most complex one, the state-of-the-art SVM (10% gain in macro-average F1 in the collection OHSUMED).

► We propose new features for text classification, called c-features.
► c-Features are derived from single terms that co-occur in documents.
► Experiments using c-features and singe terms presented important gains.
► 13% gain in mic-average F1 in the 20 Newsgroups collection with the kNN method.
► 10% gain in macro-average F1 in the collection OHSUMED using SVM, among other gains.

ناشر

Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Information Systems - Volume 36, Issue 5, July 2011, Pages 843–858

نویسندگان

Fábio Figueiredo, Leonardo Rocha, Thierson Couto, Thiago Salles, Marcos André Gonçalves, Wagner Meira Jr.,

علوم انسانی و هنر

فنی، مهندسی و علوم پایه

پزشکی و سلامت

بیو تکنولوژی

پذیرش سفارش ترجمه

Word co-occurrence features for text classification

دسترسی سریع

ارتباط

English Website