Article ID Journal Published Year Pages File Type
518267 Journal of Biomedical Informatics 2011 16 Pages PDF
Abstract

Medical Subject Headings (MeSH) are used to index the majority of databases generated by the National Library of Medicine. Essentially, MeSH terms are designed to make information, such as scientific articles, more retrievable and assessable to users of systems such as PubMed. This paper proposes a novel method for automating the assignment of biomedical publications with MeSH terms that takes advantage of citation references to these publications. Our findings show that analysing the citation references that point to a document can provide a useful source of terms that are not present in the document. The use of these citation contexts, as they are known, can thus help to provide a richer document feature representation, which in turn can help improve text mining and information retrieval applications, in our case MeSH term classification. In this paper, we also explore new methods of selecting and utilising citation contexts. In particular, we assess the effect of weighting the importance of citation terms (found in the citation contexts) according to two aspects: (i) the section of the paper they appear in and (ii) their distance to the citation marker.We conduct intrinsic and extrinsic evaluations of citation term quality. For the intrinsic evaluation, we rely on the UMLS Metathesaurus conceptual database to explore the semantic characteristics of the mined citation terms. We also analyse the “informativeness” of these terms using a class-entropy measure. For the extrinsic evaluation, we run a series of automatic document classification experiments over MeSH terms. Our experimental evaluation shows that citation contexts contain terms that are related to the original document, and that the integration of this knowledge results in better classification performance compared to two state-of-the-art MeSH classification systems: MeSHUP and MTI. Our experiments also demonstrate that the consideration of Section and Distance factors can lead to statistically significant improvements in citation feature quality, thus opening the way for better document feature representation in other biomedical text processing applications.

Graphical abstractIn this graph we illustrate how we combine the original document representation and the citations pointing to it. We develop different supervised document classifiers by combining different sources of expansion terms.Figure optionsDownload full-size imageDownload as PowerPoint slideHighlights► Citation contexts are a useful source of semantically related terms, and they strengthen the topical focus of documents. ► We built document models that significantly improve performance over state-of-the-art systems in a MeSH categorisation task. ► Our best expansion strategy involved using citation terms which also occur as semantically-related words in a domain specific dictionary. ► Attending to the position of the citations in the paper-outline helped to build better classifiers.

Keywords
Related Topics
Physical Sciences and Engineering Computer Science Computer Science Applications
Authors
, , , ,