Hap-seqX: Expedite algorithm for haplotype phasing with imputation using sequence data

Article ID	Journal	Published Year	Pages	File Type
5906624	Gene	2013	5 Pages	PDF

Abstract

Haplotype phasing is one of the most important problems in population genetics as haplotypes can be used to estimate the relatedness of individuals and to impute genotype information which is a commonly performed analysis when searching for variants involved in disease. The problem of haplotype phasing has been well studied. Methodologies for haplotype inference from sequencing data either combine a set of reference haplotypes and collected genotypes using a Hidden Markov Model or assemble haplotypes by overlapping sequencing reads. A recent algorithm Hap-seq considers using both sequencing data and reference haplotypes and it is a hybrid of a dynamic programming algorithm and a Hidden Markov Model (HMM), which is shown to be optimal. However, the algorithm requires extremely large amount of memory which is not practical for whole genome datasets. The current algorithm requires saving intermediate results to disk and reads these results back when needed, which significantly affects the practicality of the algorithm. In this work, we proposed the expedited version of the algorithm Hap-seqX, which addressed the memory issue by using a posterior probability to select the records that should be saved in memory. We show that Hap-seqX can save all the intermediate results in memory and improves the execution time of the algorithm dramatically. Utilizing the strategy, Hap-seqX is able to predict haplotypes from whole genome sequencing data.

âº We studied the problem haplotype phasing. âº Our method uses both sequencing data and reference haplotypes. âº HapâseqX is a hybrid method of dynamic programming and HMM. âº We used posterior probability to improve the memory efficiency. âº HapâseqX is much more memory efficient than Hapâseq.

Keywords

MEC MIR IBD HMM Minimum Error Correction Markov chain Monte Carlo Markov chain Monte Carlo, MCMC Linkage disequilibrium Imputation Hidden Markov model identity-by-descent Single nucleotide polymorphism SNP