کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
515580 867046 2008 13 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Global term weights in distributed environments
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر نرم افزارهای علوم کامپیوتر
پیش نمایش صفحه اول مقاله
Global term weights in distributed environments
چکیده انگلیسی

This paper examines the estimation of global term weights (such as IDF) in information retrieval scenarios where a global view on the collection is not available. In particular, the two options of either sampling documents or of using a reference corpus independent of the target retrieval collection are compared using standard IR test collections. In addition, the possibility of pruning term lists based on frequency is evaluated.The results show that very good retrieval performance can be reached when just the most frequent terms of a collection – an “extended stop word list” – are known and all terms which are not in that list are treated equally. However, the list cannot always be fully estimated from a general-purpose reference corpus, but some “domain-specific stop words” need to be added. A good solution for achieving this is to mix estimates from small samples of the target retrieval collection with ones derived from a reference corpus.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Information Processing & Management - Volume 44, Issue 3, May 2008, Pages 1049–1061
نویسندگان
,