کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
394955 665919 2008 15 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Chinese word segmentation as morpheme-based lexical chunking
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر هوش مصنوعی
پیش نمایش صفحه اول مقاله
Chinese word segmentation as morpheme-based lexical chunking
چکیده انگلیسی

Chinese word segmentation plays an important role in many Chinese language processing tasks such as information retrieval and text mining. Recent research in Chinese word segmentation focuses on tagging approaches with either characters or words as tagging units. In this paper we present a morpheme-based chunking approach and implement it in a two-stage system. It consists of two main components, namely a morpheme segmentation component to segment an input sentence to a sequence of morphemes based on morpheme-formation models and bigram language models, and a lexical chunking component to label each segmented morpheme’s position in a word of a special type with the aid of lexicalized hidden Markov models. To facilitate these tasks, a statistically-based technique is also developed for automatically compiling a morpheme dictionary from a segmented or tagged corpus. To evaluate this approach, we conduct a closed test and an open test using the 2005 SIGHAN Bakeoff data. Our system demonstrates state-of-the-art performance on different test sets, showing the benefits of choosing morphemes as tagging units. Furthermore, the open test results indicate significant performance enhancement using lexicalization and part-of-speech features.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Information Sciences - Volume 178, Issue 9, 1 May 2008, Pages 2282–2296
نویسندگان
, , ,