کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
534282 870244 2014 10 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
t-Test feature selection approach based on term frequency for text categorization
ترجمه فارسی عنوان
تست انتخاب ویژگی تست بر اساس فرکانس اصطلاح برای طبقه بندی متن
کلمات کلیدی
انتخاب ویژگی، فرکانس مدت، تست دانشجویی، طبقه بندی متن
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر چشم انداز کامپیوتر و تشخیص الگو
چکیده انگلیسی


• We prove the frequency distribution of a term is approximately normally distributed.
• We model the diversity of the frequency of a term with t-test.
• We verify our approach on two text corpora with three classifiers.
• Our approach is comparable to or even better than the state-of-the-art methods.

Feature selection techniques play an important role in text categorization (TC), especially for the large-scale TC tasks. Many new and improved methods have been proposed, and most of them are based on document frequency, such as the famous Chi-square statistic and information gain etc. These methods based on document frequency, however, have two shortcomings: (1) they are not reliable for low-frequency terms, that is, low-frequency terms will be filtered because of their smaller weights; and (2) they only count whether one term occurs within a document and ignore term frequency. Actually, high-frequency term (except stop words) occurred in few documents is often regards as a discriminators in the real-life corpus.Aimed at solving the above drawbacks, the paper focuses on how to construct a feature selection function based on term frequency, and proposes a new approach using student t-test. The t  -test function is used to measure the diversity of the distributions of a term frequency between the specific category and the entire corpus. Extensive comparative experiments on two text corpora using three classifiers show that the proposed approach is comparable to the state-of-the-art feature selection methods in terms of macro-F1F1 and micro-F1F1. Especially on micro-F1F1, our method achieves slightly better performance on Reuters with k  NN and SVMs classifiers, compared to χ2χ2, and IG.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Pattern Recognition Letters - Volume 45, 1 August 2014, Pages 1–10
نویسندگان
, , , , ,