کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
402731 676993 2013 12 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Focused crawling enhanced by CBP–SLC
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر هوش مصنوعی
پیش نمایش صفحه اول مقاله
Focused crawling enhanced by CBP–SLC
چکیده انگلیسی


• A heuristic-based approach, CBP–SLC, is presented for enhancing focused crawling.
• A weighted voting classifier using TFIPNDF feature weighting approach is built.
• 1-DNFC identifies more reliable negative documents from the unlabeled examples set.

The complexity of Web information environments and multiple-topic Web pages are negative factors significantly affecting the performance of focused crawling. In a Web page, anchors or some link-contexts may misguide focused crawling, and a highly relevant region also may be obscured owing to the low overall relevance of that page. So, partitioning Web pages into smaller blocks will significantly improve the performance. In view of above, this paper presents a heuristic-based approach, CBP–SLC (Content Block Partition–Selective Link Context), which combines Web page partition algorithm and selectively uses link-context according to the relevance of content blocks, to enhance focused Web crawling. For guiding crawler, we build a weighted voting classifier by iteratively applying the SVM algorithm based on a novel TFIDF-improved feature weighting approach. During classifying, an improved 1-DNF algorithm, called 1-DNFC, is also proposed aimed at identifying more reliable negative documents from the unlabeled examples set. Experimental results show that the performance of the classifier using TFIPNDF outperforms TFIDF, and our crawler outperforms Breadth-First, Best-First, Anchor Text Only, Link-context, SLC and CBP both in Harvest rate and Target recall, which indicate our new techniques are efficient and feasible.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Knowledge-Based Systems - Volume 51, October 2013, Pages 15–26
نویسندگان
, ,