کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
379130 659267 2008 19 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Extracting lists of data records from semi-structured web pages
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر هوش مصنوعی
پیش نمایش صفحه اول مقاله
Extracting lists of data records from semi-structured web pages
چکیده انگلیسی

Many web sources provide access to an underlying database containing structured data. These data can be usually accessed in HTML form only, which makes it difficult for software programs to obtain them in structured form. Nevertheless, web sources usually encode data records using a consistent template or layout, and the implicit regularities in the template can be used to automatically infer the structure and extract the data. In this paper, we propose a set of novel techniques to address this problem. While several previous works have addressed the same problem, most of them require multiple input pages while our method requires only one. In addition, previous methods make some assumptions about how data records are encoded into web pages, which do not always hold in real websites. Finally, we have also tested our techniques with a high number of real web sources and we have found them to be very effective.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Data & Knowledge Engineering - Volume 64, Issue 2, February 2008, Pages 491–509
نویسندگان
, , , , ,