Derivation of “is a” taxonomy from Wikipedia Category Graph

Article ID	Journal	Published Year	Pages	File Type
6854396	Engineering Applications of Artificial Intelligence	2016	22 Pages	PDF

Abstract

Knowledge acquisition still represents one of the main challenging obstacles to designing intelligent systems exhibiting human-level performance in complex intelligent tasks. The recent developments in crowdsourcing technologies have opened new promising opportunities to overcome this problem by exploiting large amounts of machine readable knowledge to perform tasks requiring human intelligence. Wikipedia is a case of this research trend, being the largest collaborative and multilingual resource and linguistic knowledge that contains unstructured and semi-structured information. In this paper, we propose an approach for deriving “is a” taxonomy from the Wikipedia Categories Graph (WCG), which is an open collaborative resource. After building and filtering the WCG from a Wikipedia dump, the process would mainly consist in the exploitation of the “BY” tag and the sharing of plural headers. These methods provide a graph formed by a set of non-connected sub-graphs. Therefore, we propose a process for linking them to finally obtain an “is a” taxonomy with only one root and modeled as a direct acyclic graph (DAG). In this work, specific DAG handling algorithms are used, including an algorithm for a DAG into sub-DAGs and another for merging two DAGs. The obtained taxonomy is assessed using semantic similarity measures, which consist in quantifying the likeness between two concepts or words. Therefore, we exploit a set of well-known benchmarks to compare the results obtained via the generated taxonomy to those achieved with WordNet, a resource created and maintained by domain experts. The experimental results revealed good correlations between computed values and human judgments. Compared to WordNet, the derived taxonomy was also noted to lead to an enhanced coverage capacity.

Keywords

Semantic similarity