Average-case linear-time similar substring searching by the q-gram distance

Article ID	Journal	Published Year	Pages	File Type
436477	Theoretical Computer Science	2014	19 Pages	PDF

Abstract

In this paper we consider the problem of similar substring searching in the q-gram distance. The q -gram distance dq(x,y)dq(x,y) is a similarity measure between two strings x and y defined by the number of different q -grams between them. The distance can be used instead of the edit distance due to its lower computation cost, O(|x|+|y|)O(|x|+|y|) vs. O(|x||y|)O(|x||y|), and its good approximation for the edit distance. However, if this distance is applied to the problem of finding all similar strings, in a long text t, to a given pattern p, the total computation cost is sometimes not acceptable. Ukkonen already proposed two fast algorithms: one with an array and the other with a tree. When “similar” means k or less in dqdq, their time complexities are O(|t|k+|p|)O(|t|k+|p|) and O(|t|logk+|p|), respectively. In this paper, we propose two algorithms of average-case complexity O(|t|+|p|)O(|t|+|p|), although their worst-case complexities are still O(|t|k+|p|)O(|t|k+|p|) and O(|t|logk+|p|), respectively. The linearity of the average-case complexity is analyzed under the assumption of random sampling of t and the condition that q is larger than a threshold. The algorithms exploit the fact that similar substrings in t are often found at very close positions if the beginning positions of the substrings are close. In the second proposed algorithm, we adopted a doubly-linked list supported by an array and a search tree to search for a list element in O(logk) time. Experimental results support their theoretical average-case complexities.

Keywords

Approximate string matching String searching