کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
10321864 660771 2015 10 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Hybrid dimension reduction by integrating feature selection with feature extraction method for text clustering
ترجمه فارسی عنوان
کاهش ابعاد ترکیبی با یکپارچه سازی ویژگی انتخاب با روش استخراج ویژگی برای خوشه بندی متن
کلمات کلیدی
خوشه بندی متن، انتخاب ویژگی، استخراج ویژگی، واریانس دوره، فرکانس سند، تجزیه و تحلیل مولفه اصلی،
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر هوش مصنوعی
چکیده انگلیسی
High dimensionality of the feature space is one of the major concerns owing to computational complexity and accuracy consideration in the text clustering. Therefore, various dimension reduction methods have been introduced in the literature to select an informative subset (or sublist) of features. As each dimension reduction method uses a different strategy (aspect) to select a subset of features, it results in different feature sublists for the same dataset. Hence, a hybrid approach, which encompasses different aspects of feature relevance altogether for feature subset selection, receives considerable attention. Traditionally, union or intersection is used to merge feature sublists selected with different methods. The union approach selects all features and the intersection approach selects only common features from considered features sublists, which leads to increase the total number of features and loses some important features, respectively. Therefore, to take the advantage of one method and lessen the drawbacks of other, a novel integration approach namely modified union is proposed. This approach applies union on selected top ranked features and applies intersection on remaining features sublists. Hence, it ensures selection of top ranked as well as common features without increasing dimensions in the feature space much. In this study, feature selection methods term variance (TV) and document frequency (DF) are used for features' relevance score computation. Next, a feature extraction method principal component analysis (PCA) is applied to further reduce dimensions in the feature space without losing much information. The effectiveness of the proposed method is tested on three benchmark datasets namely Reuters-21,578, Classic4, and WebKB. The obtained results are compared with TV, DF, and variants of the proposed hybrid dimension reduction method. The experimental studies clearly demonstrate that our proposed method improves clustering accuracy compared to the competitive methods.
ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Expert Systems with Applications - Volume 42, Issue 6, 15 April 2015, Pages 3105-3114
نویسندگان
, ,