Article ID Journal Published Year Pages File Type
431157 Journal of Discrete Algorithms 2007 11 Pages PDF
Abstract

Given a string S over a finite alphabet Σ, the character set (also called the fingerprint  ) of a substring S′S′ of S   is the subset C⊆ΣC⊆Σ of the symbols occurring in S′S′. The study of the character sets of all the substrings of a given string (or a given collection of strings) appears in several domains such as rule induction for natural language processing or comparative genomics. Several computational problems concerning the character sets of a string arise from these applications, especially:(1)Output all the maximal locations of substrings having a given character set.(2)Output for each character set C occurring in a given string (or a given collection of strings) all the maximal locations of C.Denoting by n   the total length of the considered string or collection of strings, we solve the first problem in Θ(n)Θ(n) time using Θ(n)Θ(n) space. We present two algorithms solving the second problem. The first one runs in Θ(n2)Θ(n2) time using Θ(n)Θ(n) space. The second algorithm has Θ(n|Σ|log|Σ|)Θ(n|Σ|log|Σ|) time and Θ(n)Θ(n) space complexity and is an adaptation of an algorithm by Amir et al. [A. Amir, A. Apostolico, G.M. Landau, G. Satta, Efficient text fingerprinting via Parikh mapping, J. Discrete Algorithms 26 (2003) 1–13].

Related Topics
Physical Sciences and Engineering Computer Science Computational Theory and Mathematics
Authors
, , , ,