Improved TFIDF in big news retrieval: An empirical study

کد مقاله	کد نشریه	سال انتشار	مقاله انگلیسی	نسخه تمام متن
4970097	1450026	2017	16 صفحه PDF	دانلود رایگان

عنوان انگلیسی مقاله ISI

دانلود مقاله + سفارش ترجمه

دانلود مقاله ISI انگلیسی

رایگان برای ایرانیان

کلمات کلیدی

Term weighting - مقیاس مدت

موضوعات مرتبط

مهندسی و علوم پایه مهندسی کامپیوتر چشم انداز کامپیوتر و تشخیص الگو

پیش نمایش صفحه اول مقاله

Improved TFIDF in big news retrieval: An empirical study

چکیده انگلیسی

Thomson Reuters news articles have been considered integral data sources that have given rise to several inspiring applications of text classification and clustering. The most well-known term weighting approach, the term frequency-inverseÂ document frequency (TFIDF) method, is often used to assign term weights that support such applications. Thomson Reuters reports pertinent incoming news (e.g., the refugee crisis in Europe) over a given period of time, and the most prominent terms (e.g., “refugee”) are thus frequently found in a large collection of news stories. When term weights are measured via the TFIDF method, such weights must be heavily compromised while the collection of news is sufficiently large. As the TFIDF approach is vulnerable to biases because the most important terms are typically referred to as noise, thus leading lower term weights, news retrieval without the use of the most important terms is difficult and ineffective. We thus present a new distance-based term weighting method for overcoming this bias by considering a basic characteristic whereby each news article must be similar or different from others while processing big news that include large amounts of news. All news must not be considered to contribute equally to the weighting of a particular term. In this study, the weight of a particular term is assessed based on its distance in an article to other instances of the same term, and this weight is highly sensitive to whether similar articles cause a term to occur and to whether different articles cause a term to disappear. The most important terms are thus delivered in large news corpora when studying similarities between news stories. In addition, we create a two-stage learning algorithm to refine the term's weights, and we develop an intelligent model that applies our term weighting method to Reuters news analyses based upon classification and clustering problems. The experimental results show that our methods perform better performance than TFIDF in terms of news classification and clustering.

ناشر

Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Pattern Recognition Letters - Volume 93, 1 July 2017, Pages 113-122

نویسندگان

Chien-Hsing Chen,

علوم انسانی و هنر

فنی، مهندسی و علوم پایه

پزشکی و سلامت

بیو تکنولوژی

پذیرش سفارش ترجمه

Improved TFIDF in big news retrieval: An empirical study

دسترسی سریع

ارتباط

English Website