Article ID Journal Published Year Pages File Type
6940518 Pattern Recognition Letters 2018 12 Pages PDF
Abstract
Author and author gender identification are two major tasks in the context of profiling of authors of written material. Author identification (or, more precisely, “authorship attribution”) copes with the assignment of the author, who is to be chosen from a given list of author names, to a piece of written material. Gender identification deals with the prediction of the gender of the author (male vs. female). Both tasks are very relevant to a number of applications, including, e.g., plagiarism and deception detection, document authenticity verification, and blackmailing. State of the art in both fields tends to rely mainly upon lexical and token (sequence) distribution features. But this means to neglect numerous linguistic studies that clearly indicate the high relevance of “deep linguistic”, i.e., syntactic and discourse, features to the characterization of the style of an author or a group of authors. Our work on author and gender identification confirms this relevance. We show with two different genres, namely blog posts and literary writings, that the use of deep linguistic features is very effective. It leads to  > 78% (in the case of blog posts) and  > 91% (in the case of literary writings) of accuracy in author identification and  > 89% (blog posts) and  > 90% (literary writings) of accuracy in gender identification.
Related Topics
Physical Sciences and Engineering Computer Science Computer Vision and Pattern Recognition
Authors
, ,