Enhancing lexical cohesion measure with confidence measures, semantic relations and language model interpolation for multimedia spoken content topic segmentation

Article ID	Journal	Published Year	Pages	File Type
557912	Computer Speech & Language	2012	15 Pages	PDF

Abstract

Transcript-based topic segmentation of TV programs faces several difficulties arising from transcription errors, from the presence of potentially short segments and from the limited number of word repetitions to enforce lexical cohesion, i.e., lexical relations that exist within a text to provide a certain unity. To overcome these problems, we extend a probabilistic measure of lexical cohesion based on generalized probabilities with a unigram language model. On the one hand, confidence measures and semantic relations are considered as additional sources of information. On the other hand, language model interpolation techniques are investigated for better language model estimation. Experimental topic segmentation results are presented on two corpora with distinct characteristics, composed respectively of broadcast news and reports on current affairs. Significant improvements are obtained on both corpora, demonstrating the effectiveness of the extended lexical cohesion measure for spoken TV contents, as well as its genericity over different programs.

► Adaptation of lexical cohesion based topic segmentation to TV programs specifics. ► Confidence measures and semantic relations used as additional information. ► Language model interpolation techniques used for better language model estimation. ► Domain independent technique applied on two corpora composed of TV news and reports. ► F1-measure improved by +4.9 and +3.7 for both corpus.

Keywords

Confidence measures Lexical cohesion Topic segmentation Semantic relations