Article ID Journal Published Year Pages File Type
6923573 Computers in Industry 2018 11 Pages PDF
Abstract
The increased availability of unstructured text documents in industries such as e-mails, office documents, PDF files etc., has inspired many researchers towards Information Extraction. The objective of the proposal is to extract information from unstructured tender documents of power plant industries. The extraction efficiency of recent works depends on the linguistic structure and keyword taxonomy. Hence, these approaches are unsuitable for domain specific applications that demand semantic and contextual taxonomy together. In this paper, a two-phase classification approach for information extraction with feature weighing is proposed. The proposal performs sentence classification in first phase followed by word classification. As industries spans across multiple domains, a multi domain layered industrial ontology is used for knowledge representation. The unstructured documents are enhanced into DAG based semi-structured text with enriched features. A unique feature transformation approach based on the categorical data type of features is attempted to handle heterogeneous textual features. The proposal is evaluated with real time documents obtained from power plant tenders. The results showed minimal loss of precision which can be rectified by enriching the training data and customizing standard parser algorithms to suit the domain requirements.
Related Topics
Physical Sciences and Engineering Computer Science Computer Science Applications
Authors
, , ,