Speeding up transposition-invariant string matching

Article ID	Journal	Published Year	Pages	File Type
429260	Information Processing Letters	2006	7 Pages	PDF

Abstract

Finding the longest common subsequence (LCS) of two given sequences A=a0a1…am−1 and B=b0b1…bn−1 is an important and well studied problem. We consider its generalization, transposition-invariant LCS (LCTS), which has recently arisen in the field of music information retrieval. In LCTS, we look for the LCS between the sequences A+t=(a0+t)(a1+t)…(am−1+t) and B where t is any integer. We introduce a family of algorithms (motivated by the Hunt–Szymanski scheme for LCS), improving the currently best known complexity from O(mnloglogσ) to O(Dloglogσ+mn), where σ is the alphabet size and D⩽mn is the total number of dominant matches for all transpositions. Then, we demonstrate experimentally that some of our algorithms outperform the best ones from literature.