کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
416807 681403 2006 20 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Extending fuzzy and probabilistic clustering to very large data sets
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر نظریه محاسباتی و ریاضیات
پیش نمایش صفحه اول مقاله
Extending fuzzy and probabilistic clustering to very large data sets
چکیده انگلیسی

Approximating clusters in very large (VL=unloadable) data sets has been considered from many angles. The proposed approach has three basic steps: (i) progressive sampling of the VL data, terminated when a sample passes a statistical goodness of fit test; (ii) clustering the sample with a literal (or exact) algorithm; and (iii) non-iterative extension of the literal clusters to the remainder of the data set. Extension accelerates clustering on all (loadable) data sets. More importantly, extension provides feasibility—a way to find (approximate) clusters—for data sets that are too large to be loaded into the primary memory of a single computer. A good generalized sampling and extension scheme should be effective for acceleration and feasibility using any extensible clustering algorithm. A general method for progressive sampling in VL sets of feature vectors is developed, and examples are given that show how to extend the literal fuzzy (cc-means) and probabilistic (expectation-maximization) clustering algorithms onto VL data. The fuzzy extension is called the generalized extensible fast fuzzy cc-means (geFFCM) algorithm and is illustrated using several experiments with mixtures of five-dimensional normal distributions.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Computational Statistics & Data Analysis - Volume 51, Issue 1, 1 November 2006, Pages 215–234
نویسندگان
, ,