Character sets of strings

Article ID	Journal	Published Year	Pages	File Type
431157	Journal of Discrete Algorithms	2007	11 Pages	PDF

Abstract

Given a string S over a finite alphabet Σ, the character set (also called the fingerprint ) of a substring S′S′ of S is the subset C⊆ΣC⊆Σ of the symbols occurring in S′S′. The study of the character sets of all the substrings of a given string (or a given collection of strings) appears in several domains such as rule induction for natural language processing or comparative genomics. Several computational problems concerning the character sets of a string arise from these applications, especially:(1)Output all the maximal locations of substrings having a given character set.(2)Output for each character set C occurring in a given string (or a given collection of strings) all the maximal locations of C.Denoting by n the total length of the considered string or collection of strings, we solve the first problem in Θ(n)Θ(n) time using Θ(n)Θ(n) space. We present two algorithms solving the second problem. The first one runs in Θ(n2)Θ(n2) time using Θ(n)Θ(n) space. The second algorithm has Θ(n|Σ|log|Σ|)Θ(n|Σ|log|Σ|) time and Θ(n)Θ(n) space complexity and is an adaptation of an algorithm by Amir et al. [A. Amir, A. Apostolico, G.M. Landau, G. Satta, Efficient text fingerprinting via Parikh mapping, J. Discrete Algorithms 26 (2003) 1–13].

Keywords

Character sets Fingerprints Combinatorial algorithms on words Natural Language Processing Comparative genomics