کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
534775 870288 2012 9 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Improving Korean verb–verb morphological disambiguation using lexical knowledge from unambiguous unlabeled data and selective web counts
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر چشم انداز کامپیوتر و تشخیص الگو
پیش نمایش صفحه اول مقاله
Improving Korean verb–verb morphological disambiguation using lexical knowledge from unambiguous unlabeled data and selective web counts
چکیده انگلیسی

This paper deals with verb–verb morphological disambiguation of two different verbs that have the same inflected form. The verb–verb morphological ambiguity (VVMA) is one of the critical Korean parts of speech (POS) tagging issues. The recognition of verb base forms related to ambiguous words highly depends on the lexical information in their surrounding contexts and the domains they occur in. However, current probabilistic morpheme-based POS tagging systems cannot handle VVMA adequately since most of them have a limitation to reflect a broad context of word level, and they are trained on too small amount of labeled training data to represent sufficient lexical information required for VVMA disambiguation.In this study, we suggest a classifier based on a large pool of raw text that contains sufficient lexical information to handle the VVMA. The underlying idea is that we automatically generate the annotated training set applicable to the ambiguity problem such as VVMA resolution via unlabeled unambiguous instances which belong to the same class. This enables to label ambiguous instances with the knowledge that can be induced from unambiguous instances. Since the unambiguous instances have only one label, the automatic generation of their annotated corpus are possible with unlabeled data.In our problem, since all conjugations of irregular verbs do not lead to the spelling changes that cause the VVMA, a training data for the VVMA disambiguation are generated via the instances of unambiguous conjugations related to each possible verb base form of ambiguous words. This approach does not require an additional annotation process for an initial training data set or a selection process for good seeds to iteratively augment a labeling set which are important issues in bootstrapping methods using unlabeled data. Thus, this can be strength against previous related works using unlabeled data. Furthermore, a plenty of confident seeds that are unambiguous and can show enough coverage for learning process are assured as well.We also suggest a strategy to extend the context information incrementally with web counts only to selected test examples that are difficult to predict using the current classifier or that are highly different from the pre-trained data set.As a result, automatic data generation and knowledge acquisition from unlabeled text for the VVMA resolution improved the overall tagging accuracy (token-level) by 0.04%. In practice, 9–10% out of verb-related tagging errors are fixed by the VVMA resolution whose accuracy was about 98% by using the Naïve Bayes classifier coupled with selective web counts.


► The determination of correct base forms for Korean VVMA words is recognized as a difficult problem.
► We label ambiguous verbs with the knowledge induced from unambiguous ones of the same class.
► The methods on automatic data generation and self-training from unlabeled data are suggested.
► A strategy is suggested to extend the context information incrementally with web counts.
► The tagging accuracy was improved by automatic data generation coupled with selective web counts.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Pattern Recognition Letters - Volume 33, Issue 1, 1 January 2012, Pages 62–70
نویسندگان
, , , ,