Integrating unsupervised and supervised word segmentation: The role of goodness measures

کد مقاله	کد نشریه	سال انتشار	مقاله انگلیسی	نسخه تمام متن
394348	665793	2011	21 صفحه PDF	دانلود رایگان

عنوان انگلیسی مقاله ISI

دانلود مقاله + سفارش ترجمه

دانلود مقاله ISI انگلیسی

رایگان برای ایرانیان

کلمات کلیدی

Unsupervised segmentation - تقسیم بندی بدون نظارت Chinese word segmentation - تقسیم کلمه چینی Conditional random fields - زمینه های تصادفی محرمانه

موضوعات مرتبط

مهندسی و علوم پایه مهندسی کامپیوتر هوش مصنوعی

پیش نمایش صفحه اول مقاله

Integrating unsupervised and supervised word segmentation: The role of goodness measures

چکیده انگلیسی

This study explores the feasibility of integrating unsupervised and supervised segmentation of Chinese texts for enhancing performance beyond the present state-of-the art, focusing on the critical role of the former in enhancing the latter. Following only a pre-defined goodness measure, unsupervised segmentation has the advantage of discovering many new words in raw texts, but it has the disadvantage of inevitably corrupting many known. By contrast, supervised segmentation conventionally trained only on a pre-segmented corpus is particularly good at identifying known words but possesses little intrinsic mechanism to deal with unseen ones until it is formulated as character tagging. To combine their strengths, we empirically evaluate a set of goodness measures, among which description length gain excels in word discovery, but simple strategies like word candidate pruning and assemble segmentation can further improve it. Interestingly, however, accessor variety and boundary entropy, two other goodness measures, are found more effective in enhancing the supervised learning of character tagging with the conditional random fields model. All goodness scores are discretized into feature values to enrich this model. The success of this approach has been verified by our experiments on the benchmark data sets of the last two Bakeoffs: on average, it achieves an error reduction of 6.39% over the best performance of closed test in Bakeoff-3 and ranks first in all five closed test tracks in Bakeoff-4, outperforming other participants significantly and consistently by an error reduction of 8.96%.

ناشر

Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Information Sciences - Volume 181, Issue 1, 1 January 2011, Pages 163–183

نویسندگان

Hai Zhao, Chunyu Kit,

علوم انسانی و هنر

فنی، مهندسی و علوم پایه

پزشکی و سلامت

بیو تکنولوژی

پذیرش سفارش ترجمه

Integrating unsupervised and supervised word segmentation: The role of goodness measures

دسترسی سریع

ارتباط

English Website