Article ID Journal Published Year Pages File Type
557912 Computer Speech & Language 2012 15 Pages PDF
Abstract

Transcript-based topic segmentation of TV programs faces several difficulties arising from transcription errors, from the presence of potentially short segments and from the limited number of word repetitions to enforce lexical cohesion, i.e., lexical relations that exist within a text to provide a certain unity. To overcome these problems, we extend a probabilistic measure of lexical cohesion based on generalized probabilities with a unigram language model. On the one hand, confidence measures and semantic relations are considered as additional sources of information. On the other hand, language model interpolation techniques are investigated for better language model estimation. Experimental topic segmentation results are presented on two corpora with distinct characteristics, composed respectively of broadcast news and reports on current affairs. Significant improvements are obtained on both corpora, demonstrating the effectiveness of the extended lexical cohesion measure for spoken TV contents, as well as its genericity over different programs.

► Adaptation of lexical cohesion based topic segmentation to TV programs specifics. ► Confidence measures and semantic relations used as additional information. ► Language model interpolation techniques used for better language model estimation. ► Domain independent technique applied on two corpora composed of TV news and reports. ► F1-measure improved by +4.9 and +3.7 for both corpus.

Related Topics
Physical Sciences and Engineering Computer Science Signal Processing
Authors
, , ,