A multi-document summarization system based on statistics and linguistic treatment

Article ID	Journal	Published Year	Pages	File Type
382888	Expert Systems with Applications	2014	8 Pages	PDF

Abstract

•The paper proposes a multi-document summarization system that uses statistical and linguistic.•We create a new sentence clustering algorithm to deal with the redundancy and information diversity problems.•We introduce a new graph model to explore sentence syntactic, semantic, co-reference and discourse relations.

The massive quantity of data available today in the Internet has reached such a huge volume that it has become humanly unfeasible to efficiently sieve useful information from it. One solution to this problem is offered by using text summarization techniques. Text summarization, the process of automatically creating a shorter version of one or more text documents, is an important way of finding relevant information in large text libraries or in the Internet. This paper presents a multi-document summarization system that concisely extracts the main aspects of a set of documents, trying to avoid the typical problems of this type of summarization: information redundancy and diversity. Such a purpose is achieved through a new sentence clustering algorithm based on a graph model that makes use of statistic similarities and linguistic treatment. The DUC 2002 dataset was used to assess the performance of the proposed system, surpassing DUC competitors by a 50% margin of f-measure, in the best case.

Keywords

Sentence clustering Multi-document summarization Extractive Summarization