Article ID Journal Published Year Pages File Type
378964 Data & Knowledge Engineering 2010 16 Pages PDF
Abstract

User-Defined Functions (UDFs) represent an extensibility mechanism provided by most DBMSs, whose execution happens in main memory. Also, UDFs leverage the DBMS multi-threaded capabilities and exploit the C language speed and flexibility for mathematical computations. In this article, we study how to accelerate computation of sufficient statistics on large data sets with UDFs exploiting caching and sampling techniques. We present an aggregate UDF computing multidimensional sufficient statistics that benefit a broad array of statistical models: the linear sum of points and the quadratic sum of cross-products of point dimensions. Caching can be applied when the data set fits in main memory. Otherwise, sampling is required to accelerate processing of very large data sets. Also, sampling can be applied on data sets that can be cached, to further accelerate processing. Experiments carefully analyze performance and accuracy with real and synthetic data sets. We compare UDFs working inside the DBMS and C++ reading flat files, running on the same hardware. We show UDFs can have similar performance to C++, even if both exploit caching and multi-threading. As expected, C++ is much faster than UDFs when the data set is scanned from disk. We carefully analyze the case where sampling is required with larger data sets. We show geometric and bootstrapping sampling techniques can be faster than performing full tables scans, providing high accuracy estimation of mean, variance and correlation. Even further, sampling on cached data sets can provide accurate answers in a few seconds. Detailed experiments illustrate UDF optimizations including diagonal matrix computation, join avoidance and acceleration with a multi-core CPU, when available. A profile of UDF run-time execution shows the UDF is slowed down by I/O when reading from disk.

Related Topics
Physical Sciences and Engineering Computer Science Artificial Intelligence
Authors
, ,