Information extraction for search engines using fast heuristic techniques

کد مقاله	کد نشریه	سال انتشار	مقاله انگلیسی	نسخه تمام متن
378989	659248	2010	28 صفحه PDF	دانلود رایگان

عنوان انگلیسی مقاله ISI

دانلود مقاله + سفارش ترجمه

دانلود مقاله ISI انگلیسی

رایگان برای ایرانیان

کلمات کلیدی

IE, Information extraction - استخراج اطلاعات Search engine - موتور جستجو

موضوعات مرتبط

مهندسی و علوم پایه مهندسی کامپیوتر هوش مصنوعی

پیش نمایش صفحه اول مقاله

Information extraction for search engines using fast heuristic techniques

چکیده انگلیسی

We study the structured records of web pages and the relevant problems associated with the extraction and alignment of these structured records. Current automatic wrappers are complicated because they take into consideration the problems of locating relevant data region using visual cues and the use of complicated algorithms to check the similarity of data records. In this paper, we develop a non-visual automatic wrapper which questions the need for complex visual based wrappers in data extraction. The novel techniques for our wrapper are (1) filtering rules to detect and filter out irrelevant data records, (2) a tree matching algorithm using frequency measures to increase the speed of data extraction, (3) an algorithm to calculate the number and size of the components of data records to detect the correct data region, (4) a data alignment algorithm which is able to align iterative (repetitive HTML command tags) and disjunctive (optional) data items and (5) a data merging and partitioning method to solve the imperfect segmentation problem (the problem of correctly identifying the atomic entities in data items). Results show that our wrapper is as robust and in many cases outperforms the state of the art wrappers such as ViNT and DEPTA. This wrapper could have significant speed advantages when processing large volumes of web sites data, which could be helpful in meta search engine development.

ناشر

Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Data & Knowledge Engineering - Volume 69, Issue 2, February 2010, Pages 169–196

نویسندگان

Jer Lang Hong, Eu-Gene Siew, Simon Egerton,

علوم انسانی و هنر

فنی، مهندسی و علوم پایه

پزشکی و سلامت

بیو تکنولوژی

پذیرش سفارش ترجمه

Information extraction for search engines using fast heuristic techniques

دسترسی سریع

ارتباط

English Website