Efficient computation of shortest absent words in complete genomes

Article ID	Journal	Published Year	Pages	File Type
6856665	Information Sciences	2018	10 Pages	PDF

Abstract

Computation of shortest absent words in genomes of organisms is a recent alignment free technique to compare genomes of species and to make functional inferences. This technique offers a good indication to mutations, phylogenetic tree construction, genome reconstruction, drug target identification, and pesticide development. The traditional alignments techniques are computationally expensive and impractical on complete and large genomes. Currently, there are two solutions exist to solve this problem on genome level of large sizes, stochastic-based method is running in time complexity of O(n), where the shortest length of absent words is approximated and entered manually and the other is standalone method costs O(nlog2âlog2(n+1)/2â)-time, where the shortest length is determined automatically by the algorithm itself. In this paper, we present standalone method to compute the shortest absent words in time complexity of O(n). The proposed method produces the complete set of shortest absent words from the two strands of the whole human genome in 3.5Â min. Our method scans the genomes twice. In the first round, the upper bound for the shortest length is computed and in the second round the absent words are identified. The method is efficient and easy to use; it requires only the genome as input parameter.

Keywords

Algorithms Sequence analysis Complexity analysis