کد مقاله | کد نشریه | سال انتشار | مقاله انگلیسی | نسخه تمام متن |
---|---|---|---|---|
415634 | 681218 | 2007 | 13 صفحه PDF | دانلود رایگان |

For the classification of very large data sets with a mixture model approach a two-step strategy for the estimation of the mixture is proposed. In the first step data are scaled down using compression techniques. Data compression consists of clustering the single observations into a medium number of groups and the representation of each group by a prototype, i.e. a triple of sufficient statistics (mean vector, covariance matrix, number of observations compressed). In the second step the mixture is estimated by applying an adapted EM algorithm (called sufficient EM) to the sufficient statistics of the compressed data. The estimated mixture allows the classification of observations according to their maximum posterior probability of component membership. The performance of sufficient EM in clustering a real data set from a web-usage mining application is compared to standard EM and the TwoStep clustering algorithm as implemented in SPSS. It turns out that the algorithmic efficiency of the sufficient EM algorithm is much more higher than for standard EM. While the TwoStep algorithm is even faster the results show a lack of stability.
Journal: Computational Statistics & Data Analysis - Volume 51, Issue 11, 15 July 2007, Pages 5416–5428