Article ID Journal Published Year Pages File Type
493372 Procedia Technology 2012 5 Pages PDF
Abstract

Huge amount of information is available in un-structured (text) documents. Knowledge discovery in un-structured document has been recognized as promising task in the recent years. Since un-structured document is typically formatted for human viewing, it varies widely from document to document. Frequent changes made to their formatting further causes difficulty in construction of a global schema. So, Discovery of interesting rules form it is complex and tedious process. Most of the existing system uses hand-coded wrappers to extract information, which is monotonous and time consuming. In this paper we propose a novel and hybrid approach of learning (context-free) grammar rules that are based on alignment between texts. Also it automatically discovers the grammar rules using grammatical inference of repeated pattern present in un-structured (text) document. The generated rules can be used to infer the attribute value pairs from the unstructured text document.

Related Topics
Physical Sciences and Engineering Computer Science Computer Science (General)