کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
389859 661185 2015 17 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Parallel sampling from big data with uncertainty distribution
ترجمه فارسی عنوان
نمونه برداری موازی از داده های بزرگ با توزیع نامشخص
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر هوش مصنوعی
چکیده انگلیسی

Data are inherently uncertain in most applications. Uncertainty is encountered when an experiment such as sampling is to proceed, the result of which is not known to us while leading to variety of potential outcomes. With the rapid developments of data collection and distribution storage technologies, big data have become a bigger-than-ever problem. And dealing with big data with uncertainty distribution is one of the most important issues of big data research. In this paper, we propose a Parallel Sampling method based on Hyper Surface for big data with uncertainty distribution, namely PSHS, which adopts a universal concept of Minimal Consistent Subset (MCS) of Hyper Surface Classification (HSC). Our inspiration for handling uncertainties in sampling from big data depends on (1) the inherent structure of the original sample set is uncertain for us, (2) boundary set formed of all the possible separating hyper surfaces is a fuzzy set and (3) the uncertainty of elements in MCS. PSHS is implemented based on MapReduce framework, which is a current and powerful parallel programming technique used in many fields. Experiments have been carried out on several data sets including real world data from UCI repository and synthetic data. The results show that our algorithm shrinks data sets while maintaining identical distribution, which is useful for obtaining the inherent structure of the data sets. Furthermore, the evaluation criterions of speedup, scaleup and sizeup validate its efficiency.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Fuzzy Sets and Systems - Volume 258, 1 January 2015, Pages 117–133
نویسندگان
, , , , ,