CS-SCORE: Rapid identification and removal of human genome contaminants from metagenomic datasets

Article ID	Journal	Published Year	Pages	File Type
2820630	Genomics	2015	6 Pages	PDF

Abstract

•Rapid identification of host sequences contaminating metagenomic datasets•Low memory footprint for handling datasets of any size•Sequence compositional signatures based heuristic pre-filtering mechanism•Directed-mapping approach using novel compositional metric (cs-score)

Metagenomic sequencing data, obtained from host-associated microbial communities, are usually contaminated with host genome sequence fragments. Prior to performing any downstream analyses, it is necessary to identify and remove such contaminating sequence fragments. The time and memory requirements of available host-contamination detection techniques are enormous. Thus, processing of large metagenomic datasets is a challenging task. This study presents CS-SCORE — a novel algorithm that can rapidly identify host sequences contaminating metagenomic datasets. Validation results indicate that CS-SCORE is 2–6 times faster than the current state-of-the-art methods. Furthermore, the memory footprint of CS-SCORE is in the range of 2–2.5 GB, which is significantly lower than other available tools. CS-SCORE achieves this efficiency by incorporating (1) a heuristic pre-filtering mechanism and (2) a directed-mapping approach that utilizes a novel sequence composition metric (cs-score). CS-SCORE is expected to be a handy ‘pre-processing’ utility for researchers analyzing metagenomic datasets.AvailabilityFor academic users, an implementation of CS-SCORE is freely available at: http://metagenomics.atc.tcs.com/cs-score (or) https://metagenomics.atc.tcs.com/preprocessing/cs-score.

Keywords

DNA contamination Cluster analysis Metagenomics