Article ID Journal Published Year Pages File Type
436477 Theoretical Computer Science 2014 19 Pages PDF
Abstract

In this paper we consider the problem of similar substring searching in the q-gram distance. The q  -gram distance dq(x,y)dq(x,y) is a similarity measure between two strings x and y defined by the number of different q  -grams between them. The distance can be used instead of the edit distance due to its lower computation cost, O(|x|+|y|)O(|x|+|y|) vs. O(|x||y|)O(|x||y|), and its good approximation for the edit distance. However, if this distance is applied to the problem of finding all similar strings, in a long text t, to a given pattern p, the total computation cost is sometimes not acceptable. Ukkonen already proposed two fast algorithms: one with an array and the other with a tree. When “similar” means k   or less in dqdq, their time complexities are O(|t|k+|p|)O(|t|k+|p|) and O(|t|logk+|p|), respectively. In this paper, we propose two algorithms of average-case complexity O(|t|+|p|)O(|t|+|p|), although their worst-case complexities are still O(|t|k+|p|)O(|t|k+|p|) and O(|t|logk+|p|), respectively. The linearity of the average-case complexity is analyzed under the assumption of random sampling of t and the condition that q is larger than a threshold. The algorithms exploit the fact that similar substrings in t   are often found at very close positions if the beginning positions of the substrings are close. In the second proposed algorithm, we adopted a doubly-linked list supported by an array and a search tree to search for a list element in O(logk) time. Experimental results support their theoretical average-case complexities.

Related Topics
Physical Sciences and Engineering Computer Science Computational Theory and Mathematics
Authors
, , ,