Data-local Reduce Task Scheduling

Article ID	Journal	Published Year	Pages	File Type
488511	Procedia Computer Science	2016	8 Pages	PDF

Abstract

Inspired by the victory of Apache's Hadoop this paper suggests a new reduce task scheduler. Hadoop is an open source implementation of Google's MapReduce framework. Programs which are written in this functional style are automatically executed and parallelized on a large cluster of commodity machines. The details how to partition the input data, setting up the program's for execution across a set of machines, handling failures of machine and managing the necessary inter-device communication is taken care by runtime system. In the current versions of Hadoop, the map tasks are scheduled with respect to the locality of their inputs in order to shrink network traffic and improve performance. On the other hand, the reduce tasks are scheduled without taking into consideration data locality leading to ruin the performance at requesting nodes. In this paper, we use data locality that is natural with reduce tasks. To accomplish the same, we schedule them on nodes that will result in least amount data- local traffic. Experimental results signify an 11-80 percent decrease in the number of bytes shuffled in a Hadoop cluster.

Keywords

MapReduce Hadoop