An approach on Chinese microblog entity linking combining baidu encyclopaedia and word2vec

Article ID	Journal	Published Year	Pages	File Type
4960866	Procedia Computer Science	2017	9 Pages	PDF

Abstract

Microblog such as Twitter and Sina Weibo provides a convenient and instant platform which makes information easy to share and acquire. However, Microblog's short, noisy, real-time features make Chinese Microblog entity linking task a new challenge. In this paper, we investigate many linking methods and introduce the implementation of our work on Chinese microblog entity linking task. By means of crawling Baidu encyclopaedia web page, we generate polysemous, synonymous and index collections in MongoDB to manage the entities. We use a Chinese NLP tools named HanLP1 to perform noun words extracting, and then generate candidate set with these collections and word similarity. For disambiguation part, we take Word2vec2 whose model is trained by THUC news3 to determine the textual relevance. Our work performs pretty well on the Sina Weibo data set.

Keywords

word2vec Entity linking