Free-gram phrase identification for modeling Chinese text

کد مقاله	کد نشریه	سال انتشار	مقاله انگلیسی	نسخه تمام متن
427544	686518	2013	8 صفحه PDF	دانلود رایگان

عنوان انگلیسی مقاله ISI

دانلود مقاله + سفارش ترجمه

دانلود مقاله ISI انگلیسی

رایگان برای ایرانیان

کلمات کلیدی

Information retrieval - بازیابی اطلاعات Text categorization - طبقه بندی متن Sparse coding - کدینگ اسپارس یا کدگذاری تنک

موضوعات مرتبط

مهندسی و علوم پایه مهندسی کامپیوتر نظریه محاسباتی و ریاضیات

پیش نمایش صفحه اول مقاله

Free-gram phrase identification for modeling Chinese text

چکیده انگلیسی

Vector space model using bag of phrases plays an important role in modeling Chinese text. However, the conventional way of using fixed gram scanning to identify free-length phrases is costly. To address this problem, we propose a novel approach for key phrase identification which is capable of identify phrases with all lengths and thus improves the coding efficiency and discrimination of the data representation. In the proposed method, we first convert each document into a context graph, a directed graph that encapsulates the statistical and positional information of all the 2-word strings in the document. We treat every transmission path in the graph as a hypothesis for a phrase, and select the corresponding phrase as a candidate phrase if the hypothesis is valid in the original document. Finally, we selectively divide some of the complex candidate phrases into sub-phrases to improve the coding efficiency, resulting in a set of phrases for codebook construction. The experiments on both balanced and unbalanced datasets show that the codebooks generated by our approach are more efficient than those by conventional methods (one syntactical method and three statistical methods are investigated). Furthermore, the data representation created by our approach has demonstrated higher discrimination than those by conventional methods in classification task.

► We present an unsupervised free-length Chinese phrases extraction method without requirement of prior knowledge.
► Syntactical feature and statistical feature are incorporated into the formation of phrases.
► Syntactical information is beneficial to form sparse codebook (phrases), while statistical feature is helpful to improve the representativeness of specific document.
► The experimental result shows that our method significant outperform baseline representation models on coding efficiency and discriminativeness.

ناشر

Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Information Processing Letters - Volume 113, Issue 4, 28 February 2013, Pages 137–144

نویسندگان

Xi Peng, Zhang Yi, Xiao-Yong Wei, De-Zhong Peng, Yong-Sheng Sang,

علوم انسانی و هنر

فنی، مهندسی و علوم پایه

پزشکی و سلامت

بیو تکنولوژی

پذیرش سفارش ترجمه

Free-gram phrase identification for modeling Chinese text

دسترسی سریع

ارتباط

English Website