Learning a taxonomy from a set of text documents

Article ID	Journal	Published Year	Pages	File Type
496718	Applied Soft Computing	2012	11 Pages	PDF

Abstract

We present a methodology for learning a taxonomy from a set of text documents that each describes one concept. The taxonomy is obtained by clustering the concept definition documents with a hierarchical approach to the Self-Organizing Map. In this study, we compare three different feature extraction approaches with varying degree of language independence. The feature extraction schemes include fuzzy logic-based feature weighting and selection, statistical keyphrase extraction, and the traditional tf-idf weighting scheme. The experiments are conducted for English, Finnish, and Spanish. The results show that while the rule-based fuzzy logic systems have an advantage in automatic taxonomy learning, taxonomies can also be constructed with tolerable results using statistical methods without domain- or style-specific knowledge.

► We learn a taxonomy from a set of encyclopedia articles. ► A hierarchical approach to the Self-Organizing Map is used to cluster the documents. ► Experiments are conducted for English, Finnish and Spanish. ► Rule-based systems have an advantage in automatic taxonomy learning. ► Taxonomies can be constructed with good results also using statistical methods.

Keywords

Keyphrase extraction Knowledge representation Document clustering Fuzzy logic Self-organizing map Multilinguality