SBBS: A sliding blocking algorithm with backtracking sub-blocks for duplicate data detection

Article ID	Journal	Published Year	Pages	File Type
383082	Expert Systems with Applications	2014	9 Pages	PDF

Abstract

•It clearly analyzes inserting and deleting operations in the traditional SB algorithm.•It proposes the concept of matching-failed segments due to above operations.•It proposes an efficient sliding blocking algorithm with backtracking sub-blocks.•SBBS can detect duplicate data as many as possible in matching-failed segments.

With the explosive growth of data, storage systems are facing huge storage pressure due to a mass of redundant data caused by the duplicate copies or regions of files. Data deduplication is a storage-optimization technique that reduces the data footprint by eliminating multiple copies of redundant data and storing only unique data. The basis of data deduplication is duplicate data detection techniques, which divide files into a number of parts, compare corresponding parts between files via hash techniques and find out redundant data. This paper proposes an efficient sliding blocking algorithm with backtracking sub-blocks called SBBS for duplicate data detection. SBBS improves the duplicate data detection precision of the traditional sliding blocking (SB) algorithm via backtracking the left/right 1/4 and 1/2 sub-blocks in matching-failed segments. Experimental results show that SBBS averagely improves the duplicate detection precision by 6.5% compared with the traditional SB algorithm and by 16.5% compared with content-defined chunking (CDC) algorithm, and it does not increase much extra storage overhead when SBBS divides the files into equal chunks of size 8 kB.

Keywords

Data deduplication backtracking