کد مقاله | کد نشریه | سال انتشار | مقاله انگلیسی | نسخه تمام متن |
---|---|---|---|---|
402589 | 676968 | 2015 | 14 صفحه PDF | دانلود رایگان |
The Author Profiling (AP) task aims to reveal as much as possible information from a given author’s document (e.g., age, gender, etc.). AP is crucial for several applications, ranging from customized advertising to computer forensics, psychology, and entertainment. Nonetheless, the AP task is far from being solved, particularly in social media domains, where the nature of documents hinder the applicability of state-of-the-art text mining tools (e.g., because of spelling-grammar errors, huge vocabularies, and the presence of many out-of-vocabulary terms). Currently, most of the work in AP for social media has been devoted to the development of descriptive features, which are used under standard representations, such as the Bag-of-Words (BoW). Nevertheless, BoW-like representations have some well known shortcomings, namely: (i) the sparsity and high dimensionality of the representation, and (ii) the failure to capture relationships, other than mere occurrence, among terms. This paper focuses on the study of alternative document representations that can deal with such issues. We propose a representation for documents that capture discriminative and subprofile-specific information of terms. Under the proposed representation, terms are represented in a vector space that captures discriminative information. Then, term representations are aggregated to represent the content of a document. In this manner, documents are represented in a low-dimensional (and discriminative) space which is non-sparse. We evaluate the effectiveness of the proposed representation on several corpora from the social media domain. The proposed representation is compared to the standard BoW representation and a wide variety of state-of-the-art AP approaches. Experimental results reveal that the proposed representation outperforms most of the reference methodologies. Furthermore, we show that the proposed representation is in agreement with previous studies on handcrafted attributes for AP.
Journal: Knowledge-Based Systems - Volume 89, November 2015, Pages 134–147