Article ID Journal Published Year Pages File Type
5906624 Gene 2013 5 Pages PDF
Abstract

Haplotype phasing is one of the most important problems in population genetics as haplotypes can be used to estimate the relatedness of individuals and to impute genotype information which is a commonly performed analysis when searching for variants involved in disease. The problem of haplotype phasing has been well studied. Methodologies for haplotype inference from sequencing data either combine a set of reference haplotypes and collected genotypes using a Hidden Markov Model or assemble haplotypes by overlapping sequencing reads. A recent algorithm Hap-seq considers using both sequencing data and reference haplotypes and it is a hybrid of a dynamic programming algorithm and a Hidden Markov Model (HMM), which is shown to be optimal. However, the algorithm requires extremely large amount of memory which is not practical for whole genome datasets. The current algorithm requires saving intermediate results to disk and reads these results back when needed, which significantly affects the practicality of the algorithm. In this work, we proposed the expedited version of the algorithm Hap-seqX, which addressed the memory issue by using a posterior probability to select the records that should be saved in memory. We show that Hap-seqX can save all the intermediate results in memory and improves the execution time of the algorithm dramatically. Utilizing the strategy, Hap-seqX is able to predict haplotypes from whole genome sequencing data.

► We studied the problem haplotype phasing. ► Our method uses both sequencing data and reference haplotypes. ► Hap‐seqX is a hybrid method of dynamic programming and HMM. ► We used posterior probability to improve the memory efficiency. ► Hap‐seqX is much more memory efficient than Hap‐seq.

Related Topics
Life Sciences Biochemistry, Genetics and Molecular Biology Genetics
Authors
, ,