Benchmarking performance for migrating a relational application to a parallel implementation

Article ID	Journal	Published Year	Pages	File Type
424515	Future Generation Computer Systems	2016	9 Pages	PDF

Abstract

Many organizations rely on relational database platforms for OLAP-style querying (aggregation and filtering) for small to medium size applications. We investigate the impact of scaling up the data sizes for such queries. We intend to illustrate what kind of performance results an organization could expect should they migrate current applications to big data environments. This paper benchmarks the performance of Hive (Thusoo et al., 2009) [9], a parallel data warehouse platform that is a part of the Hadoop software stack. We set up a 4-node Hadoop cluster using Hortonworks HDP 1.3.2 (Hortonworks HDP 1.3.2). We use the data generator provided by the TPC-DS benchmark (DSGen v1.1.0) to generate data of different scales. We compare the performance of loading data and querying for SQL and Hive Query Language (HiveQL) on a relational database installation (MySQL) and on a Hive cluster, respectively. We measure the speedup for query execution for three dataset sizes resulting from the scale up. Hive loads the large datasets faster than MySQL, while it is marginally slower than MySQL when loading the smaller datasets. Query execution in Hive is also faster. We also investigate executing Hive queries concurrently in workloads and conclude that serial execution of queries is a much better practice for clusters with limited resources.

Keywords

SQL Benchmarking Hadoop Queries Big Data Hive