Automated scraping of structured data records from health discussion forums using semantic analysis

Article ID	Journal	Published Year	Pages	File Type
6898927	Informatics in Medicine Unlocked	2018	17 Pages	PDF

Abstract

The amount of information available in the Internet has an exponential growth and therefore, obtaining appropriate information from such a huge repository is an indispensable yet complicated task. As the structuring of web pages is diverse across websites, there is no “one size fits all” technique to perform web data extraction. It results in the need for devising a technique that is independent of structuring of web pages, which is addressed in this paper by identifying informative content through semantic analysis rather than syntactic structure. Social web forums contain web pages which are generated using server-side templates and the information present in such websites has wide variety of applications like opinion mining, sentiment analysis, topic modeling, trend analysis etc. Of the social media forums, health discussion forums play a crucial role and analyzing data extracted from such medical forums find its application in disease detection based on symptoms, determining adverse drug reactions, suggestion of clinical tests for diseases and so on. In this paper, a fully automated technique for extracting posts from various Medical Forum Websites has been devised and it performs well for differently structured web pages belonging to diverse forum websites. Since, the technique is based on semantic features, it can be applied to other social web forums as well.

Keywords

XPath Information extraction Structured data