کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
4970097 1450026 2017 16 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Improved TFIDF in big news retrieval: An empirical study
کلمات کلیدی
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر چشم انداز کامپیوتر و تشخیص الگو
پیش نمایش صفحه اول مقاله
Improved TFIDF in big news retrieval: An empirical study
چکیده انگلیسی
Thomson Reuters news articles have been considered integral data sources that have given rise to several inspiring applications of text classification and clustering. The most well-known term weighting approach, the term frequency-inverse document frequency (TFIDF) method, is often used to assign term weights that support such applications. Thomson Reuters reports pertinent incoming news (e.g., the refugee crisis in Europe) over a given period of time, and the most prominent terms (e.g., “refugee”) are thus frequently found in a large collection of news stories. When term weights are measured via the TFIDF method, such weights must be heavily compromised while the collection of news is sufficiently large. As the TFIDF approach is vulnerable to biases because the most important terms are typically referred to as noise, thus leading lower term weights, news retrieval without the use of the most important terms is difficult and ineffective. We thus present a new distance-based term weighting method for overcoming this bias by considering a basic characteristic whereby each news article must be similar or different from others while processing big news that include large amounts of news. All news must not be considered to contribute equally to the weighting of a particular term. In this study, the weight of a particular term is assessed based on its distance in an article to other instances of the same term, and this weight is highly sensitive to whether similar articles cause a term to occur and to whether different articles cause a term to disappear. The most important terms are thus delivered in large news corpora when studying similarities between news stories. In addition, we create a two-stage learning algorithm to refine the term's weights, and we develop an intelligent model that applies our term weighting method to Reuters news analyses based upon classification and clustering problems. The experimental results show that our methods perform better performance than TFIDF in terms of news classification and clustering.
ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Pattern Recognition Letters - Volume 93, 1 July 2017, Pages 113-122
نویسندگان
,