|نسخه تمام متن
|16 صفحه PDF
• We propose a novel approach based on Formal Concept Analysis for Topic Detection.
• Our proposal overcomes traditional problems of the clustering and classification techniques.
• We analyse the parameters involved in the process in a Twitter-based framework.
• We propose a topic selection methodology based on the stability concept.
• We overcome the state-of-the-art results for the task.
The Topic Detection Task in Twitter represents an indispensable step in the analysis of text corpora and their later application in Online Reputation Management. Classification, clustering and probabilistic techniques have been traditionally applied, but they have some well-known drawbacks such as the need to fix the number of topics to be detected or the problem of how to integrate the prior knowledge of topics with the detection of new ones. This motivates the current work, where we present a novel approach based on Formal Concept Analysis (FCA), a fully unsupervised methodology to group similar content together in thematically-based topics (i.e., the FCA formal concepts) and to organize them in the form of a concept lattice. Formal concepts are conceptual representations based on the relationships between tweet terms and the tweets that have given rise to them. It allows, in contrast to other approaches in the literature, their clear interpretability. In addition, the concept lattice represents a formalism that describes the data, explores correlations, similarities, anomalies and inconsistencies better than other representations such as clustering models or graph-based representations. Our rationale is that these theoretical advantages may improve the Topic Detection process, making them able to tackle the problems related to the task. To prove this point, our FCA-based proposal is evaluated in the context of a real-life Topic Detection task provided by the Replab 2013 CLEF Campaign. To demonstrate the efficiency of the proposal, we have carried out several experiments focused on testing: (a) the impact of terminology selection as an input to our algorithm, (b) the impact of concept selection as the outcome of our algorithm, and; (c) the efficiency of the proposal to detect new and previously unseen topics (i.e., topic adaptation). An extensive analysis of the results has been carried out, proving the suitability of our proposal to integrate previous knowledge of prior topics without losing the ability to detect novel and unseen topics as well as improving the best Replab 2013 results.
Journal: Expert Systems with Applications - Volume 57, 15 September 2016, Pages 21–36