Industrial information extraction through multi-phase classification using ontology for unstructured documents

Article ID	Journal	Published Year	Pages	File Type
6923573	Computers in Industry	2018	11 Pages	PDF

Abstract

The increased availability of unstructured text documents in industries such as e-mails, office documents, PDF files etc., has inspired many researchers towards Information Extraction. The objective of the proposal is to extract information from unstructured tender documents of power plant industries. The extraction efficiency of recent works depends on the linguistic structure and keyword taxonomy. Hence, these approaches are unsuitable for domain specific applications that demand semantic and contextual taxonomy together. In this paper, a two-phase classification approach for information extraction with feature weighing is proposed. The proposal performs sentence classification in first phase followed by word classification. As industries spans across multiple domains, a multi domain layered industrial ontology is used for knowledge representation. The unstructured documents are enhanced into DAG based semi-structured text with enriched features. A unique feature transformation approach based on the categorical data type of features is attempted to handle heterogeneous textual features. The proposal is evaluated with real time documents obtained from power plant tenders. The results showed minimal loss of precision which can be rectified by enriching the training data and customizing standard parser algorithms to suit the domain requirements.

Keywords

Information extraction Feature Transformation Ontology