کد مقاله | کد نشریه | سال انتشار | مقاله انگلیسی | نسخه تمام متن |
---|---|---|---|---|
453478 | 694925 | 2016 | 11 صفحه PDF | دانلود رایگان |

• We present the architectural design of a focused Web crawler that combines link-based and content-based approaches to predict the topical focus of an unvisited page.
• We present a custom method using the Dewey decimal classification system to best classify the subject of an unvisited page into standard human knowledge categories.
• To prioritize an unvisited URL, we use a dynamic, flexible and updating hierarchical data structure called T-Graph. It helps find the shortest path to get to on-topic pages on the Web.
• For a background review, the experimental results from several crawlers are presented.
• The functional and non-functional requirements of a focused Web crawler are elicited and described, as well as standard evaluation criteria.
The two significant tasks of a focused Web crawler are finding relevant documents and prioritizing them for effective download. For the first task, we propose an algorithm to fetch and analyze the most effective HTML elements of the page to predict and elicit the topical focus of each unvisited page with high accuracy. For the second task, we propose a scoring function of the relevant URLs through the use of T-Graph to prioritize each unvisited link. Thus, our novel method uniquely combines these approaches, giving precision and recall values close to 50%, which indicate the significance of the proposed architecture.
Journal: Computer Standards & Interfaces - Volume 43, January 2016, Pages 1–11