Article ID Journal Published Year Pages File Type
10355211 Information Processing & Management 2005 21 Pages PDF
Abstract
This paper presents a hybrid approach to identifying protein names in biomedical texts, which is regarded as a crucial step for text mining. Our approach employs a set of simple heuristics for initial detection of protein names and uses a probabilistic model for locating complete protein names. In addition, a protein name dictionary is complementarily consulted. In contrast to previously proposed methods, our proposed method avoids the use of natural language processing tools such as part-of-speech taggers and syntactic parsers and solely relies on surface clues, so as to reduce the processing overhead. Moreover, we propose a framework to automatically create a large-scale corpus annotated with protein names, which can be then used for training our probabilistic model. We implemented a protein name identification system, named Protex, based on our proposed method and evaluated it by comparing with a system developed by other researchers on a common test set. The experiments showed that the automatically constructed corpus is equally useful in training as compared with manually annotated corpora and that effective performance can be achieved in identifying compound protein names with Protex.
Related Topics
Physical Sciences and Engineering Computer Science Computer Science Applications
Authors
, ,