RCrawler: An R package for parallel web crawling and scraping

کد مقاله	کد نشریه	سال انتشار	مقاله انگلیسی	نسخه تمام متن
4978383	1452369	2017	9 صفحه PDF	دانلود رایگان

عنوان انگلیسی مقاله ISI

دانلود مقاله + سفارش ترجمه

دانلود مقاله ISI انگلیسی

رایگان برای ایرانیان

کلمات کلیدی

R package - بسته R web crawler - خزنده وب Web mining - معدن وب Data collection - گردآوری داده

موضوعات مرتبط

مهندسی و علوم پایه مهندسی کامپیوتر نرم افزار

پیش نمایش صفحه اول مقاله

RCrawler: An R package for parallel web crawling and scraping

چکیده انگلیسی

RCrawler is a contributed R package for domain-based web crawling and content scraping. As the first implementation of a parallel web crawler in the R environment, RCrawler can crawl, parse, store pages, extract contents, and produce data that can be directly employed for web content mining applications. However, it is also flexible, and could be adapted to other applications. The main features of RCrawler are multi-threaded crawling, content extraction, and duplicate content detection. In addition, it includes functionalities such as URL and content-type filtering, depth level controlling, and a robot.txt parser. Our crawler has a highly optimized system, and can download a large number of pages per second while being robust against certain crashes and spider traps. In this paper, we describe the design and functionality of RCrawler, and report on our experience of implementing it in an R environment, including different optimizations that handle the limitations of R. Finally, we discuss our experimental results.

ناشر

Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: SoftwareX - Volume 6, 2017, Pages 98-106

نویسندگان

Salim Khalil, Mohamed Fakir,

علوم انسانی و هنر

فنی، مهندسی و علوم پایه

پزشکی و سلامت

بیو تکنولوژی

پذیرش سفارش ترجمه

RCrawler: An R package for parallel web crawling and scraping

دسترسی سریع

ارتباط

English Website