کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
396925 670631 2013 19 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Learning to crawl deep web
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر هوش مصنوعی
پیش نمایش صفحه اول مقاله
Learning to crawl deep web
چکیده انگلیسی

Deep web or hidden web refers to the hidden part of the Web (usually residing in structured databases) that remains unavailable for standard Web crawlers. Obtaining content of the deep web is challenging and has been acknowledged as a significant gap in the coverage of search engines. The paper proposes a novel deep web crawling framework based on reinforcement learning, in which the crawler is regarded as an agent and deep web database as the environment. The agent perceives its current state and selects an action (query) to submit to the environment (the deep web database) according to Q-value. While the existing methods rely on an assumption that all deep web databases possess full-text search interfaces and solely utilize the statistics (TF or DF) of acquired data records to generate the next query, the reinforcement learning framework not only enables crawlers to learn a promising crawling strategy from its own experience, but also allows for utilizing diverse features of query keywords. Experimental results show that the method outperforms the state of art methods in terms of crawling capability and relaxes the assumption of full-text search implied by existing methods.


► We introduce a reinforcement learning framework for deep web surfacing.
► We propose a surfacing algorithm for both full text and non-full text databases.
► The crawler learns to differentiate rewarding queries from unpromising ones.
► A Q-value approximation algorithm is developed to enable future reward estimation.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Information Systems - Volume 38, Issue 6, September 2013, Pages 801–819
نویسندگان
, , , , ,