کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
453478 694925 2016 11 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
A focused crawler combinatory link and content model based on T-Graph principles
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر شبکه های کامپیوتری و ارتباطات
پیش نمایش صفحه اول مقاله
A focused crawler combinatory link and content model based on T-Graph principles
چکیده انگلیسی


• We present the architectural design of a focused Web crawler that combines link-based and content-based approaches to predict the topical focus of an unvisited page.
• We present a custom method using the Dewey decimal classification system to best classify the subject of an unvisited page into standard human knowledge categories.
• To prioritize an unvisited URL, we use a dynamic, flexible and updating hierarchical data structure called T-Graph. It helps find the shortest path to get to on-topic pages on the Web.
• For a background review, the experimental results from several crawlers are presented.
• The functional and non-functional requirements of a focused Web crawler are elicited and described, as well as standard evaluation criteria.

The two significant tasks of a focused Web crawler are finding relevant documents and prioritizing them for effective download. For the first task, we propose an algorithm to fetch and analyze the most effective HTML elements of the page to predict and elicit the topical focus of each unvisited page with high accuracy. For the second task, we propose a scoring function of the relevant URLs through the use of T-Graph to prioritize each unvisited link. Thus, our novel method uniquely combines these approaches, giving precision and recall values close to 50%, which indicate the significance of the proposed architecture.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Computer Standards & Interfaces - Volume 43, January 2016, Pages 1–11
نویسندگان
, ,