کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
430864 688218 2013 18 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Various improvements to text fingerprinting
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر نظریه محاسباتی و ریاضیات
پیش نمایش صفحه اول مقاله
Various improvements to text fingerprinting
چکیده انگلیسی

Let s=s1..sns=s1..sn be a text (or sequence) on a finite alphabet Σ of size σ. A fingerprint in s   is the set of distinct characters appearing in one of its substrings. The problem considered here is to compute the set FF of all fingerprints of all substrings of s   in order to answer efficiently certain questions on this set. A substring si..sjsi..sj is a maximal location for a fingerprint f∈Ff∈F (denoted by 〈i,j〉〈i,j〉) if the alphabet of si..sjsi..sj is f   and si−1si−1, sj+1sj+1, if defined, are not in f. The set of maximal locations in s   is LL (it is easy to see that |L|⩽nσ|L|⩽nσ). Two maximal locations 〈i,j〉〈i,j〉 and 〈k,l〉〈k,l〉 such that si..sj=sk..slsi..sj=sk..sl are named copies  , and the quotient set of LL according to the copy relation is denoted by LCLC.We first present new exact efficient algorithms and data structures for the following three problems: (1) to compute FF; (2) given f as a set of distinct characters in Σ, to answer if f   represents a fingerprint in FF; (3) given f, to find all maximal locations of f in s  . As well as in papers concerning succinct data structures, in the paper all space complexities are counted in bits. Problem 1 is solved either in O(n+|LC|logσ) worst-case time (in this paper all logarithms are intended as base two logarithms) using O((n+|LC|+|F|logσ)logn) bits of space, or in O(n+|L|logσ) randomized expected time using O((n+|F|logσ)logn) bits of space. Problem 2 is solved either in O(|f|)O(|f|) expected time if only O(|f|logn) bits of working space for queries is allowed, or in worst-case O(|f|/ϵ)O(|f|/ϵ) time if a working space of O(σϵlogn) bits is allowed (with ϵ   a constant satisfying 0<ϵ<10<ϵ<1). These algorithms use a data structure that occupies |F|(2logσ+log2e)(1+o(1)) bits. Problem 3 is solved with the same time complexity as Problem 2, but with the addition of an occ term to each of the complexities, where occ   is the number of maximal locations corresponding to the given fingerprint. Our solution of this last problem requires a data structure that occupies O((n+|LC|)logn) bits of memory.In the second part of our paper we present a novel Monte Carlo approximate construction approach. Problem 1 is thus solved in O(n+|L|)O(n+|L|) expected time using O(|F|logn) bits of space but the algorithm is incorrect with an extremely small probability that can be bounded in advance.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Journal of Discrete Algorithms - Volume 22, September 2013, Pages 1–18
نویسندگان
, , ,