کد مقاله | کد نشریه | سال انتشار | مقاله انگلیسی | نسخه تمام متن |
---|---|---|---|---|
534034 | 870207 | 2015 | 8 صفحه PDF | دانلود رایگان |
• A key issue when mining web information is the labeling problem.
• Tolerance rough sets are used to structure categorical instances and relations.
• We report a semi-supervised algorithm (TPL) that labels the ontological information.
• The Never Ending Language Learner (Nell) system provides the ontology.
• The performance of TPL compares well with CBS and CPL and handles concept drift.
A key issue when mining web information is the labeling problem: data are abundant on the web but is unlabeled. In this paper, we address this problem by proposing (i) a granular model that structures categorical noun phrase instances as well as semantically related noun phrase pairs from a given corpus representing unstructured web pages with a tolerance form of rough sets, (ii) a semi-supervised Tolerant Pattern Learning (TPL) algorithm that labels categorical instances as well as relations. This work is an extension of the TPL algorithm presented in our earlier paper. Our model treats noun phrases, which are described as sets of their co-occurring contextual patterns. We use the ontological information from the Never Ending Language Learner (Nell) system. We compared the performance of our algorithm with Coupled Bayesian Sets (CBS) and Coupled Pattern Learner (CPL) algorithms for categorical and relational extractions, respectively. Experimental results suggest that TPL can achieve comparable performance with CBS and CPL in terms of precision.
Journal: Pattern Recognition Letters - Volume 67, Part 2, 1 December 2015, Pages 130–137