A focused crawler combinatory link and content model based on T-Graph principles

کد مقاله	کد نشریه	سال انتشار	مقاله انگلیسی	نسخه تمام متن
453478	694925	2016	11 صفحه PDF	دانلود رایگان

عنوان انگلیسی مقاله ISI

دانلود مقاله + سفارش ترجمه

دانلود مقاله ISI انگلیسی

رایگان برای ایرانیان

کلمات کلیدی

Focused Web crawler Information retrieval - بازیابی اطلاعات Search engine - موتور جستجو

موضوعات مرتبط

مهندسی و علوم پایه مهندسی کامپیوتر شبکه های کامپیوتری و ارتباطات

پیش نمایش صفحه اول مقاله

A focused crawler combinatory link and content model based on T-Graph principles

چکیده انگلیسی

• We present the architectural design of a focused Web crawler that combines link-based and content-based approaches to predict the topical focus of an unvisited page.
• We present a custom method using the Dewey decimal classification system to best classify the subject of an unvisited page into standard human knowledge categories.
• To prioritize an unvisited URL, we use a dynamic, flexible and updating hierarchical data structure called T-Graph. It helps find the shortest path to get to on-topic pages on the Web.
• For a background review, the experimental results from several crawlers are presented.
• The functional and non-functional requirements of a focused Web crawler are elicited and described, as well as standard evaluation criteria.

The two significant tasks of a focused Web crawler are finding relevant documents and prioritizing them for effective download. For the first task, we propose an algorithm to fetch and analyze the most effective HTML elements of the page to predict and elicit the topical focus of each unvisited page with high accuracy. For the second task, we propose a scoring function of the relevant URLs through the use of T-Graph to prioritize each unvisited link. Thus, our novel method uniquely combines these approaches, giving precision and recall values close to 50%, which indicate the significance of the proposed architecture.

ناشر

Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Computer Standards & Interfaces - Volume 43, January 2016, Pages 1–11

نویسندگان

Ali Seyfi, Ahmed Patel,

علوم انسانی و هنر

فنی، مهندسی و علوم پایه

پزشکی و سلامت

بیو تکنولوژی

پذیرش سفارش ترجمه

A focused crawler combinatory link and content model based on T-Graph principles

دسترسی سریع

ارتباط

English Website