کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
379126 659267 2008 14 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Deterministic algorithms for sampling count data
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر هوش مصنوعی
پیش نمایش صفحه اول مقاله
Deterministic algorithms for sampling count data
چکیده انگلیسی

Processing and extracting meaningful knowledge from count data is an important problem in data mining. The volume of data is increasing dramatically as the data is generated by day-to-day activities such as market basket data, web clickstream data or network data. Most mining and analysis algorithms require multiple passes over the data, which requires extreme amounts of time. One solution to save time would be to use samples, since sampling is a good surrogate for the data and the same sample can be used to answer many kinds of queries. In this paper, we propose two deterministic sampling algorithms, Biased-L2 and DRS. Both produce samples vastly superior to the previous deterministic and random algorithms, both in sample quality and accuracy. Our algorithms also improve on the run-time and memory footprint of the existing deterministic algorithms. The new algorithms can be used to sample from a relational database as well as data streams, with the ability to examine each transaction only once, and maintain the sample on-the-fly in a streaming fashion. We further show how to engineer one of our algorithms (DRS) to adapt and recover from changes to the underlying data distribution, or sample size. We evaluate our algorithms on three different synthetic datasets, as well as on real-world clickstream data, and demonstrate the improvements over previous art.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Data & Knowledge Engineering - Volume 64, Issue 2, February 2008, Pages 405–418
نویسندگان
, , ,