Average-case linear-time similar substring searching by the q-gram distance

کد مقاله	کد نشریه	سال انتشار	مقاله انگلیسی	نسخه تمام متن
436477	690008	2014	19 صفحه PDF	دانلود رایگان

عنوان انگلیسی مقاله ISI

دانلود مقاله + سفارش ترجمه

دانلود مقاله ISI انگلیسی

رایگان برای ایرانیان

کلمات کلیدی

Approximate string matching - تطابق رشته تقریبی String searching - رشته جستجو

موضوعات مرتبط

مهندسی و علوم پایه مهندسی کامپیوتر نظریه محاسباتی و ریاضیات

پیش نمایش صفحه اول مقاله

Average-case linear-time similar substring searching by the q-gram distance

چکیده انگلیسی

In this paper we consider the problem of similar substring searching in the q-gram distance. The q -gram distance dq(x,y)dq(x,y) is a similarity measure between two strings x and y defined by the number of different q -grams between them. The distance can be used instead of the edit distance due to its lower computation cost, O(|x|+|y|)O(|x|+|y|) vs. O(|x||y|)O(|x||y|), and its good approximation for the edit distance. However, if this distance is applied to the problem of finding all similar strings, in a long text t, to a given pattern p, the total computation cost is sometimes not acceptable. Ukkonen already proposed two fast algorithms: one with an array and the other with a tree. When “similar” means k or less in dqdq, their time complexities are O(|t|k+|p|)O(|t|k+|p|) and O(|t|logk+|p|), respectively. In this paper, we propose two algorithms of average-case complexity O(|t|+|p|)O(|t|+|p|), although their worst-case complexities are still O(|t|k+|p|)O(|t|k+|p|) and O(|t|logk+|p|), respectively. The linearity of the average-case complexity is analyzed under the assumption of random sampling of t and the condition that q is larger than a threshold. The algorithms exploit the fact that similar substrings in t are often found at very close positions if the beginning positions of the substrings are close. In the second proposed algorithm, we adopted a doubly-linked list supported by an array and a search tree to search for a list element in O(logk) time. Experimental results support their theoretical average-case complexities.

ناشر

Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Theoretical Computer Science - Volume 530, 17 April 2014, Pages 23–41

نویسندگان

Hiroyuki Hanada, Mineichi Kudo, Atsuyoshi Nakamura,

علوم انسانی و هنر

فنی، مهندسی و علوم پایه

پزشکی و سلامت

بیو تکنولوژی

پذیرش سفارش ترجمه

Average-case linear-time similar substring searching by the q-gram distance

دسترسی سریع

ارتباط

English Website