Using chi-square statistics to measure similarities for text categorization

کد مقاله	کد نشریه	سال انتشار	مقاله انگلیسی	نسخه تمام متن
385962	660876	2011	6 صفحه PDF	دانلود رایگان

عنوان انگلیسی مقاله ISI

دانلود مقاله + سفارش ترجمه

دانلود مقاله ISI انگلیسی

رایگان برای ایرانیان

کلمات کلیدی

Nonparametric statistics - آمار غیر پارامتریک Text mining - متن‌کاوی Machine learning - یادگیری ماشین

موضوعات مرتبط

مهندسی و علوم پایه مهندسی کامپیوتر هوش مصنوعی

پیش نمایش صفحه اول مقاله

Using chi-square statistics to measure similarities for text categorization

چکیده انگلیسی

In this paper, we propose using chi-square statistics to measure similarities and chi-square tests to determine the homogeneity of two random samples of term vectors for text categorization. The properties of chi-square tests for text categorization are studied first. One of the advantages of chi-square test is that its significance level is similar to the miss rate that provides a foundation for theoretical performance (i.e. miss rate) guarantee. Generally a classifier using cosine similarities with TF ∗ IDF performs reasonably well in text categorization. However, its performance may fluctuate even near the optimal threshold value. To improve the limitation, we propose the combined usage of chi-square statistics and cosine similarities. Extensive experiment results verify properties of chi-square tests and performance of the combined usage.

Research highlights
► For a text categorization task, chi-square statistics can be used to measure dissimilarities and chi-square tests can be used as classifiers.
► The significance level of a chi-square test and the miss rate for the corresponding text categorization task are completely positive correlated.
► A chi-square test can determine the homogeneity of two random samples of original TF vectors without difficulty.
► A classifier using both cosine similarities with TF ∗ IDF and chi-square statistics as its similarity measures performs in par or better than one using only cosine similarity with TF ∗ IDF in F1.

ناشر

Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Expert Systems with Applications - Volume 38, Issue 4, April 2011, Pages 3085–3090

نویسندگان

Yao-Tsung Chen, Meng Chang Chen,

علوم انسانی و هنر

فنی، مهندسی و علوم پایه

پزشکی و سلامت

بیو تکنولوژی

پذیرش سفارش ترجمه

Using chi-square statistics to measure similarities for text categorization

دسترسی سریع

ارتباط

English Website