کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
529885 869719 2015 15 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Stratified feature sampling method for ensemble clustering of high dimensional data
ترجمه فارسی عنوان
روش نمونه گیری ویژگی طبقه بندی برای خوشه بندی گروهی داده های با ابعاد بزرگ
کلمات کلیدی
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر چشم انداز کامپیوتر و تشخیص الگو
چکیده انگلیسی


• A new component generation approach is proposed to produce ensemble components.
• Stratified sampling is used to generate subspace component data sets.
• The component data can well represent the characteristics of the original data.
• The proposed method can achieve a consistent performance in ensemble clustering.
• The proposed method is easy to implement.

High dimensional data with thousands of features present a big challenge to current clustering algorithms. Sparsity, noise and correlation of features are common characteristics of such data. Another common phenomenon is that clusters in such high dimensional data often exist in different subspaces. Ensemble clustering is emerging as a prominent technique for improving robustness, stability and accuracy of high dimensional data clustering. In this paper, we propose a stratified sampling method for generating subspace component data sets in ensemble clustering of high dimensional data. Instead of randomly sampling a subset of features for each component data set, in this method we first cluster the features of high dimensional data into a few feature groups called feature strata. Using stratified sampling, we randomly sample some features from each feature stratum and merge the sampled features from different feature strata to generate a component data set. In this way, the component data sets have better representations of the clustering structure in the original data set. Comparing with random sampling and random projection methods in synthetic data analysis, the component clustering by stratified sampling has demonstrated that the average clustering accuracy was increased without sacrificing clustering diversity. We carried out a series of experiments on eight real world data sets from microarray, text and image domains to evaluate ensemble clustering methods using three subspace component data generation methods and four consensus functions. The experimental results consistently showed that the stratified sampling method produced the best ensemble clustering results in all data sets. The ensemble clustering with stratified sampling also outperformed three other ensemble clustering methods which generate component clusters from the entire space of the original data.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Pattern Recognition - Volume 48, Issue 11, November 2015, Pages 3688–3702
نویسندگان
, , ,