کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
394223 665785 2011 19 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
On the effectiveness of subwords for lexical cohesion based story segmentation of Chinese broadcast news
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر هوش مصنوعی
پیش نمایش صفحه اول مقاله
On the effectiveness of subwords for lexical cohesion based story segmentation of Chinese broadcast news
چکیده انگلیسی

Story segmentation divides a multimedia stream into homogenous regions each addressing a central topic. Lexical cohesion is a reasonable indicator for story boundaries. However, for story segmentation of Chinese broadcast news, directly measuring word level lexical cohesion is not applicable, because the texts transcribed from audio is highly unreliable and the inevitable speech recognition errors may significantly break word cohesion, thus heavily degrading the segmentation performance. To address the problem, we propose to use subword level cohesion in story segmentation of Chinese broadcast news, because Chinese subwords play great semantic roles and show robustness to speech recognition errors. We provide a comprehensive study on the effectiveness of subword units in story segmentation of Chinese speech recognition transcripts, and analyze the influence of recognition errors to the segmentation performance. Specifically, we study subword-based TextTiling and lexical chaining approaches to story segmentation, in which lexical cohesion is measured using either character or syllable n-grams (n = 1, 2, 3, 4). Our extensive experiments demonstrate performance improvement of subword unigrams and bigrams over word-based methods. For instance, tested on the CCTV corpus, character unigram lexical chaining obtains a relative F1-measure gain of 12% over words on erroneous brief news transcripts (with word error rate of 40.9%). Generally, we find that subword-based methods can often obtain better segmentation than word-based ones for both error-free and erroneous transcripts.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Information Sciences - Volume 181, Issue 13, 1 July 2011, Pages 2873–2891
نویسندگان
, , ,