کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
402894 677025 2011 7 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Domain phrase identification using atomic word formation in Chinese text
کلمات کلیدی
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر هوش مصنوعی
پیش نمایش صفحه اول مقاله
Domain phrase identification using atomic word formation in Chinese text
چکیده انگلیسی

Chinese word segmentation is a difficult and challenging job because Chinese has no white space to mark word boundaries. Its result largely depends on the quality of the segmentation dictionary. Many domain phrases are cut into single words for they are not contained in the general dictionary. This paper demonstrates a Chinese domain phrase identification algorithm based on atomic word formation. First, atomic word formation algorithm is used to extract candidate strings from corpus after pretreatment. These extracted strings are stored as the candidate domain phrase set. Second, a lot of strategies such as repeated substring screening, part of speech (POS) combination filtering, and prefix and suffix filtering and so on are used to filter the candidate domain phrases. Third, a domain phrase refining method is used to determine whether a string is a domain phrase or not by calculating the domain relevance of this string. Finally, sort all the identified strings and then export them to users. With the help of morphological rules, this method uses the combination of statistical information and rules instead of corpus machine learning. Experiments proved that this method can obtain better results than traditional n-gram methods.


► We study an atomic word formation algorithm to extract domain phrases from corpus.
► Filtering strategies are used to filter those non-domain phrases.
► Domain refining is used to calculate domain relevance of each candidate string.
► This method can get good precision and efficiency than n-gram algorithm.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Knowledge-Based Systems - Volume 24, Issue 8, December 2011, Pages 1254–1260
نویسندگان
, , , ,