Practical Identification of Dynamic Precedence Criteria to Produce Critical Results from Big Data Streams

Article ID	Journal	Published Year	Pages	File Type
10327326	Big Data Research	2015	18 Pages	PDF

Abstract

During periods of high volume, big data stream applications may not have enough resources to process all incoming tuples. To maximize the production of the most critical results under such resource shortages, a recent solution, PR (short for Preferential Result), utilizes both static criteria (defined at compile-time) and dynamic criteria (identified online at run-time) to prioritize the processing of tuples throughout the query pipeline. Unfortunately, locating the optimal criteria placement (i.e., where in the query pipeline to evaluate each prioritization criteria) is extremely compute-intensive and runs in exponential time. This makes PR impractical for complex big data stream systems. Our proposed criteria selection and placement approach, PR-Prune (short for Preferential Result-Pruning), is practical. PR-Prune prunes ineffective dynamic criteria and combines multiple criteria along the same pipeline. To achieve this, PR-Prune seeks to expand the duration in the query pipeline that tuples identified as critical are pulled forward. Our experiments use a real data stream from the S&P 500 stocks, synthetic data streams, and a diverse set of queries. The results substantiate that PR-Prune increases the production of the most critical results compared to the state-of-the-art approaches. In addition, PR-Prune significantly lowers the optimization search time compared to PR.