کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
494822 862808 2015 16 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
An improved focused crawler based on Semantic Similarity Vector Space Model
ترجمه فارسی عنوان
یک خزنده متمرکز بهبود یافته براساس مدل فضایی مشابهت معنایی
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر نرم افزارهای علوم کامپیوتر
چکیده انگلیسی


• An improved retrieval model – the Semantic Similarity Vector Space Model (SSVSM).
• The proposed model accurately predicts the unvisited URLs – priorities to the given topic.
• The proposed model guides focused crawlers to download large quantity and high quality web pages.

A focused crawler is topic-specific and aims selectively to collect web pages that are relevant to a given topic from the Internet. In many studies, the Vector Space Model (VSM) and Semantic Similarity Retrieval Model (SSRM) take advantage of cosine similarity and semantic similarity to compute similarities between web pages and the given topic. However, if there are no common terms between a web page and the given topic, the VSM will not obtain the proper topical similarity of the web page. In addition, if all of the terms between them are synonyms, then the SSRM will also not obtain the proper topical similarity. To address these problems, this paper proposes an improved retrieval model, the Semantic Similarity Vector Space Model (SSVSM), which integrates the TF*IDF values of the terms and the semantic similarities among the terms to construct topic and document semantic vectors that are mapped to the same double-term set, and computes the cosine similarities between these semantic vectors as topic-relevant similarities of documents, including the full texts and anchor texts of unvisited hyperlinks. Next, the proposed model predicts the priorities of the unvisited hyperlinks by integrating the full text and anchor text topic-relevant similarities. The experimental results demonstrate that this approach improves the performance of the focused crawlers and outperforms other focused crawlers based on Breadth-First, VSM and SSRM. In conclusion, this method is significant and effective for focused crawlers.

Figure optionsDownload as PowerPoint slide

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Applied Soft Computing - Volume 36, November 2015, Pages 392–407
نویسندگان
, , , ,