Article ID Journal Published Year Pages File Type
6856665 Information Sciences 2018 10 Pages PDF
Abstract
Computation of shortest absent words in genomes of organisms is a recent alignment free technique to compare genomes of species and to make functional inferences. This technique offers a good indication to mutations, phylogenetic tree construction, genome reconstruction, drug target identification, and pesticide development. The traditional alignments techniques are computationally expensive and impractical on complete and large genomes. Currently, there are two solutions exist to solve this problem on genome level of large sizes, stochastic-based method is running in time complexity of O(n), where the shortest length of absent words is approximated and entered manually and the other is standalone method costs O(nlog2⌈log2(n+1)/2⌉)-time, where the shortest length is determined automatically by the algorithm itself. In this paper, we present standalone method to compute the shortest absent words in time complexity of O(n). The proposed method produces the complete set of shortest absent words from the two strands of the whole human genome in 3.5 min. Our method scans the genomes twice. In the first round, the upper bound for the shortest length is computed and in the second round the absent words are identified. The method is efficient and easy to use; it requires only the genome as input parameter.
Related Topics
Physical Sciences and Engineering Computer Science Artificial Intelligence
Authors
,