کد مقاله | کد نشریه | سال انتشار | مقاله انگلیسی | نسخه تمام متن |
---|---|---|---|---|
436477 | 690008 | 2014 | 19 صفحه PDF | دانلود رایگان |

In this paper we consider the problem of similar substring searching in the q-gram distance. The q -gram distance dq(x,y)dq(x,y) is a similarity measure between two strings x and y defined by the number of different q -grams between them. The distance can be used instead of the edit distance due to its lower computation cost, O(|x|+|y|)O(|x|+|y|) vs. O(|x||y|)O(|x||y|), and its good approximation for the edit distance. However, if this distance is applied to the problem of finding all similar strings, in a long text t, to a given pattern p, the total computation cost is sometimes not acceptable. Ukkonen already proposed two fast algorithms: one with an array and the other with a tree. When “similar” means k or less in dqdq, their time complexities are O(|t|k+|p|)O(|t|k+|p|) and O(|t|logk+|p|), respectively. In this paper, we propose two algorithms of average-case complexity O(|t|+|p|)O(|t|+|p|), although their worst-case complexities are still O(|t|k+|p|)O(|t|k+|p|) and O(|t|logk+|p|), respectively. The linearity of the average-case complexity is analyzed under the assumption of random sampling of t and the condition that q is larger than a threshold. The algorithms exploit the fact that similar substrings in t are often found at very close positions if the beginning positions of the substrings are close. In the second proposed algorithm, we adopted a doubly-linked list supported by an array and a search tree to search for a list element in O(logk) time. Experimental results support their theoretical average-case complexities.
Journal: Theoretical Computer Science - Volume 530, 17 April 2014, Pages 23–41