Article ID Journal Published Year Pages File Type
6873072 Future Generation Computer Systems 2018 8 Pages PDF
Abstract
Clustered storage systems as HDFS have become a popular platform for handling Big Data, susceptible to failures and data recovery whenever a failure is detected. However, existing recovery schemes in HDFS are passive, which affects the processing efficiency of MapReduce jobs and degrades the performance of Hadoop storage systems. In order to address this problem, this paper proposes a novel scheme called PP (Popularity-based Proactive Data Recovery), that significantly boost the recovery efficiency of HDFS RAID systems. PP tracks the popular data and immediately recovers the missing popular data when a node fails in Hadoop system, which effect and variation on the performance of MapReduce jobs is minimal. The lightweight prototype implementation of PP and extensive experiments demonstrate that, compared with Hadoop's default locally-first scheduling and the degraded-first scheduling, PP significantly reduces the recovery time as well the execution time of MapReduce jobs concurrently.
Keywords
Related Topics
Physical Sciences and Engineering Computer Science Computational Theory and Mathematics
Authors
, , , ,