کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
515177 866964 2010 16 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Improving probabilistic information retrieval by modeling burstiness of words
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر نرم افزارهای علوم کامپیوتر
پیش نمایش صفحه اول مقاله
Improving probabilistic information retrieval by modeling burstiness of words
چکیده انگلیسی

The classical probabilistic models attempt to capture the ad hoc information retrieval problem within a rigorous probabilistic framework. It has long been recognized that the primary obstacle to the effective performance of the probabilistic models is the need to estimate a relevance model. The Dirichlet compound multinomial (DCM) distribution based on the Polya Urn scheme, which can also be considered as a hierarchical Bayesian model, is a more appropriate generative model than the traditional multinomial distribution for text documents. We explore a new probabilistic model based on the DCM distribution, which enables efficient retrieval and accurate ranking. Because the DCM distribution captures the dependency of repetitive word occurrences, the new probabilistic model based on this distribution is able to model the concavity of the score function more effectively. To avoid the empirical tuning of retrieval parameters, we design several parameter estimation algorithms to automatically set model parameters. Additionally, we propose a pseudo-relevance feedback algorithm based on the mixture modeling of the Dirichlet compound multinomial distribution to further improve retrieval accuracy. Finally, our experiments show that both the baseline probabilistic retrieval algorithm based on the DCM distribution and the corresponding pseudo-relevance feedback algorithm outperform the existing language modeling systems on several TREC retrieval tasks. The main objective of this research is to develop an effective probabilistic model based on the DCM distribution. A secondary objective is to provide a thorough understanding of the probabilistic retrieval model by a theoretical understanding of various text distribution assumptions.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Information Processing & Management - Volume 46, Issue 2, March 2010, Pages 143–158
نویسندگان
, ,