Optimizing distributed data stream processing by tracing

Article ID	Journal	Published Year	Pages	File Type
11002412	Future Generation Computer Systems	2019	21 Pages	PDF

Abstract

By using Apache Spark as illustration, we show how various data stream processing efficiency issues can be mitigated or optimized by our distributed tracing engine. We describe and qualitatively compare two different designs, one based on reporting to a distributed database and another based on trace piggybacking. Our prototype implementation consists of wrappers suitable for JVM environments in general, with minimal impact on the source code of the core system. Our tracing framework is the first to solve tracing in multiple systems across boundaries and to provide detailed performance measurements suitable for automated optimization, not just debugging.

Keywords

Apache Spark distributed data processing Data stream processing Data Provenance