کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
4942738 1437416 2017 10 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Tackling topic general words in topic modeling
ترجمه فارسی عنوان
مسائل کلیدی موضوع در مدل سازی موضوع
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر هوش مصنوعی
چکیده انگلیسی


- Study the problem of topic general words in topic modeling.
- Propose a metric generality score to measure the generality of a word.
- Propose a new topic model generality-sensitive LDA to exploit generality scores in modeling.
- Propose a continuous learning approach that can use multiple domains to find topic general words.

Topic models are a prevailing tool for exploring latent topics in documents, and for helping to complete many NLP tasks. To obtain good topics for a corpus, a preprocessing step is often needed to remove common stop words and identify topic general words (TGW) from the corpus. Such words can seriously harm the topic formation because they create spurious co-occurrence of unrelated words. Also, they are likely to occupy top positions of multiple topics, lead to many unrelated words being grouped under a topic, and consequently result in inscrutable and similar topics. In an application, one typically manually identifies and removes a list of TGWs in the corpus. This is a time consuming process and very hard to do by a layman user. In this paper, we aim to solve this problem automatically. The proposed approaches can be based on the current corpus alone or multiple corpora. In the latter case, a novel continuous learning method is proposed that learns from past results of multiple domain corpora to help identify TGWs in the current domain. We conduct experiments in two real-world datasets, and the experimental results show that the proposed approaches achieve superior results.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Engineering Applications of Artificial Intelligence - Volume 62, June 2017, Pages 124-133
نویسندگان
, , ,