کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
4967022 1449303 2017 10 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Machine-learned cluster identification in high-dimensional data
ترجمه فارسی عنوان
شناسایی خوشه ماشین شناسایی در داده های با ابعاد بزرگ
کلمات کلیدی
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر نرم افزارهای علوم کامپیوتر
چکیده انگلیسی


- High-dimensional biomedical data are commonly searched for structures.
- Common cluster algorithms may impose non-existent clusters or assign data to the wrong clusters.
- We highlight accepted proposals of using emergent self-organizing maps for clustering.
- The addition of the U-matrix provides a visually controllable representation of cluster structure.

BackgroundHigh-dimensional biomedical data are frequently clustered to identify subgroup structures pointing at distinct disease subtypes. It is crucial that the used cluster algorithm works correctly. However, by imposing a predefined shape on the clusters, classical algorithms occasionally suggest a cluster structure in homogenously distributed data or assign data points to incorrect clusters. We analyzed whether this can be avoided by using emergent self-organizing feature maps (ESOM).MethodsData sets with different degrees of complexity were submitted to ESOM analysis with large numbers of neurons, using an interactive R-based bioinformatics tool. On top of the trained ESOM the distance structure in the high dimensional feature space was visualized in the form of a so-called U-matrix. Clustering results were compared with those provided by classical common cluster algorithms including single linkage, Ward and k-means.ResultsWard clustering imposed cluster structures on cluster-less “golf ball”, “cuboid” and “S-shaped” data sets that contained no structure at all (random data). Ward clustering also imposed structures on permuted real world data sets. By contrast, the ESOM/U-matrix approach correctly found that these data contain no cluster structure. However, ESOM/U-matrix was correct in identifying clusters in biomedical data truly containing subgroups. It was always correct in cluster structure identification in further canonical artificial data. Using intentionally simple data sets, it is shown that popular clustering algorithms typically used for biomedical data sets may fail to cluster data correctly, suggesting that they are also likely to perform erroneously on high dimensional biomedical data.ConclusionsThe present analyses emphasized that generally established classical hierarchical clustering algorithms carry a considerable tendency to produce erroneous results. By contrast, unsupervised machine-learned analysis of cluster structures, applied using the ESOM/U-matrix method, is a viable, unbiased method to identify true clusters in the high-dimensional space of complex data.

3-D representation of high dimensional data following ESOM projection and visualization of group (cluster) structures using the U-matrix, which employs a geographical map analogy of valleys where members of the same cluster are located, separated by mountain ranges marking cluster borders.274

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Journal of Biomedical Informatics - Volume 66, February 2017, Pages 95-104
نویسندگان
, ,