Approximate all-pairs suffix/prefix overlaps

Article ID	Journal	Published Year	Pages	File Type
426623	Information and Computation	2012	10 Pages	PDF

Abstract

Finding approximate overlaps is the first phase of many sequence assembly methods. Given a set of strings of total length n and an error-rate ϵ, the goal is to find, for all-pairs of strings, their suffix/prefix matches (overlaps) that are within edit distance k=⌈ϵℓ⌉, where ℓ is the length of the overlap. We propose a new solution for this problem based on backward backtracking (Lam, et al., 2008) and suffix filters (Kärkkäinen and Na, 2008). Our technique uses bits of space, where Hk is the k-th order entropy and σ the alphabet size. In practice, it is more scalable in terms of space, and comparable in terms of time, than q-gram filters (Rasmussen, et al., 2006). Our method is also easy to parallelize and scales up to millions of DNA reads.