Article ID | Journal | Published Year | Pages | File Type |
---|---|---|---|---|
15227 | Computational Biology and Chemistry | 2011 | 8 Pages |
De novo sequence assembly is a ubiquitous combinatorial problem in all DNA sequencing technologies. In the presence of errors in the experimental data, the assembly problem is computationally challenging, and its solution may not lead to a unique reconstruct. The enumeration of all alternative solutions is important in drawing a reliable conclusion on the target sequence, and is often overlooked in the heuristic approaches that are currently available. In this paper, we develop an integer programming formulation and global optimization solution strategy to solve the sequence assembly problem with errors in the data. We also propose an efficient technique to identify all alternative reconstructs. When applied to examples of sequencing-by-hybridization, our approach dramatically increases the length of DNA sequences that can be handled with global optimality certificate to over 10,000, which is more than 10 times longer than previously reported. For some problem instances, alternative solutions exhibited a wide range of different ability in reproducing the target DNA sequence. Therefore, it is important to utilize the methodology proposed in this paper in order to obtain all alternative solutions to reliably infer the true reconstruct. These alternative solutions can be used to refine the obtained results and guide the design of further experiments to correctly reconstruct the target DNA sequence.
Graphical abstractFigure optionsDownload full-size imageDownload as PowerPoint slideHighlights► We developed a novel integer programming framework for DNA sequence assembly. ► The formulation effectively identifies biologically meaningful reconstructs. ► The strategy provides all alternative global optima. ► We demonstrate that alternative optima are critical for correct reconstruction. ► The approach reconstructs 10 times longer DNA sequences than previously reported.