کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
382162 660739 2016 17 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Predicate enrichment of aligned XPaths for wrapper induction
کلمات کلیدی
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر هوش مصنوعی
پیش نمایش صفحه اول مقاله
Predicate enrichment of aligned XPaths for wrapper induction
چکیده انگلیسی


• Proposed XPath predicate enrichments for wrapper induction approach.
• Built on generalisation strategy that aligns and merge XPaths.
• Focus on taking full advantage of XPath syntax for wrapper construction.
• Test data (PostgreSQL db) supplied, based on the work of Hao et al. (2011).
• Method can be used to merge data from various heterogeneous sources.

Extracting data from various semi-structured sources is a topic that has received a lot of attention. Wrapper induction specifically has been studied extensively, where users annotate a couple of data sources with examples of the data they want, after which a procedure (wrapper) is constructed that can optimally extract similar data as well. In this paper a novel wrapper induction approach is proposed, exploiting the premise of the general applicability of the XPath query language, studied specifically within the context of web pages. After a user annotates a limited set of web pages with the required data, a generalised XPath is constructed that is capable of extracting the examples and, optimally, similar data as well. This generalised baseline XPath is then enriched with predicates, based on context and structure of the data sources, to optimise the precision/recall balance of the data extraction capability of the wrapper. Six variations of such limiting predicates are introduced and investigated. In this paper, it is shown that the baseline approach often generalises the samples too much, leading to a decreased precision. Enriching the baseline wrapper by the addition of predicates limits the generalisation power of the queries in an intelligent manner. Experimental results show that there is a significant improvement in the overall precision of the generalised query, without an excessive loss in recall. Documented tests and real world experience with a large amount of data show that the technique is flexible, easily understood and applicable in a broad range of applications. It is not only of interest in the fields of web information retrieval, but can also be used in the contexts of, e.g., reverse engineering of databases, ontology expansion and deep web data mining, as both simple lists of data and complex structures can be extracted.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Expert Systems with Applications - Volume 51, 1 June 2016, Pages 259–275
نویسندگان
, , ,