Article ID Journal Published Year Pages File Type
4943680 Expert Systems with Applications 2016 18 Pages PDF
Abstract
In this paper we show how a vector-based word representation obtained via word2vec can help to improve the results of a document classifier based on bags of words. Both models allow obtaining numeric representations from texts, but they do it very differently. The bag of words model can represent documents by means of widely dispersed vectors in which the indices are words or groups of words. word2vec generates word level representations building vectors that are much more compact, where indices implicitly contain information about the context of word occurrences. Bags of words are very effective for document classification and in our experiments no representation using only word2vec vectors is able to improve their results. However, this does not mean that the information provided by word2vec is not useful for the classification task. When this information is used in combination with the bags of words, the results are improved, showing its complementarity and its contribution to the task. We have also performed cross-domain experiments in which word2vec has shown much more stable behavior than bag of words models.
Related Topics
Physical Sciences and Engineering Computer Science Artificial Intelligence
Authors
, , ,