Article ID Journal Published Year Pages File Type
484132 Procedia Computer Science 2016 10 Pages PDF
Abstract

Scientific community across many disciplines is exploring new ways to extract knowledge from all available sources. Historically, written manuscripts have been the media of choice for recording experimental findings. Many disciplines such as social science, medical science are exploring ways to automate knowledge discovery from a vast repository of published scientific work. This work attempts to accelerate the process of information extraction by extending Kepler, a graphical workflow management tool. Kepler provides a simple way of designing and executing complex workflows in the form of directed graphs. This work presents a scalable approach to convert published research as PDF documents into indexable XML documents using Kepler. This conversion is a critical step in the Natural Language Processing pipeline. Kepler's distributed data processing capability enables scientists to scale this critical computation by simply adding more computing resources over the cloud.

Related Topics
Physical Sciences and Engineering Computer Science Computer Science (General)
Authors
, , , , , ,