کد مقاله | کد نشریه | سال انتشار | مقاله انگلیسی | نسخه تمام متن |
---|---|---|---|---|
484842 | 703295 | 2015 | 8 صفحه PDF | دانلود رایگان |

Various appealing ideas have been recently proposed in the statistical literature to scale-up machine learning techniques and solve predictive/inferential problems from “Big Datasets”. Beyond the massively parallelized and distributed approaches exploiting hardware architectures and programming frameworks that have received increasing interest these last few years, several variants of the Stochastic Gradient Descent (SGD) method based on “smart” sampling procedures have been designed for accelerating the model fitting stage. Such techniques exploit either the form of the objective functional or some supposedly available auxiliary information, and have been thoroughly investigated from a theoretical viewpoint. Though attractive, such statistical methods must be also analyzed from a computational perspective, bearing the possible options offered by recent technological advances in mind. It is thus of vital importance to investigate how to implement efficiently these inferential principles in order to achieve the best trade-off between computational time and accuracy. In this paper, we explore the scalability of the SGD techniques introduced in [9,11] from an experimental perspective. Issues related to their implementation on distributed computing platforms such as Apache Spark are also discussed and experimental results based on large-scale real datasets are displayed in order to illustrate the relevance of the promoted approaches.
Journal: Procedia Computer Science - Volume 53, 2015, Pages 308-315