کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
397492 671251 2011 11 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Suffix trees for inputs larger than main memory
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر هوش مصنوعی
پیش نمایش صفحه اول مقاله
Suffix trees for inputs larger than main memory
چکیده انگلیسی

A suffix tree is a fundamental data structure for string searching algorithms. Unfortunately, when it comes to the use of suffix trees in real-life applications, the current methods for constructing suffix trees do not scale for large inputs. As suffix trees are larger than the input sequences and quickly outgrow the main memory, the first attempts at building large suffix trees focused on algorithms which avoid massive random access to the trees being built. However, all the existing practical algorithms perform random access to the input string, thus requiring in essence that the input be small enough to be kept in main memory. The constantly growing pool of string data, especially biological sequences, requires us to build suffix trees for much larger strings.We are the first to present an algorithm which is able to construct suffix trees for input sequences significantly larger than the size of the available main memory. Both the input string and the suffix tree are kept on disk and the algorithm is designed to avoid multiple random I/Os to both of them.1 As a proof of concept, we show that our method allows to build the suffix tree for 12 GB of real DNA sequences in 26 h on a single machine with 2 GB of RAM. This input is four times the size of the Human Genome, and the construction of suffix trees for inputs of such magnitude was never reported before.

Research Highlights
► In this paper we introduce a new method for suffix tree construction using secondary storage, which avoids random access to the disk-based data structures.
► This is the first method that handles inputs exceeding the main memory without causing disk thrashing.
► We describe new layouts of the constructed suffix tree, which makes this index easily accessible without loading the entire tree into the main memory.
► The experimental evaluation presented here shows that the suffix tree construction using our new method is feasible for inputs, which are several times larger than the available main memory.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Information Systems - Volume 36, Issue 3, May 2011, Pages 644–654
نویسندگان
, , ,