کد مقاله | کد نشریه | سال انتشار | مقاله انگلیسی | نسخه تمام متن |
---|---|---|---|---|
515466 | 867023 | 2015 | 18 صفحه PDF | دانلود رایگان |
• We analyse the named entity recognition and disambiguation performance on tweets.
• Multiple state-of-the-art systems are included.
• Commercial and academic systems suffer the same range of problems.
• Lack of context is a major problem, demanding new, custom NER & NEL approaches.
• A named entity linking corpus is released with the paper.
Applying natural language processing for mining and intelligent information access to tweets (a form of microblog) is a challenging, emerging research area. Unlike carefully authored news text and other longer content, tweets pose a number of new challenges, due to their short, noisy, context-dependent, and dynamic nature. Information extraction from tweets is typically performed in a pipeline, comprising consecutive stages of language identification, tokenisation, part-of-speech tagging, named entity recognition and entity disambiguation (e.g. with respect to DBpedia). In this work, we describe a new Twitter entity disambiguation dataset, and conduct an empirical analysis of named entity recognition and disambiguation, investigating how robust a number of state-of-the-art systems are on such noisy texts, what the main sources of error are, and which problems should be further investigated to improve the state of the art.
Journal: Information Processing & Management - Volume 51, Issue 2, March 2015, Pages 32–49