Article ID | Journal | Published Year | Pages | File Type |
---|---|---|---|---|
10355238 | Information Processing & Management | 2005 | 10 Pages |
Abstract
This paper describes an extensible, open-source (GPL) data repository and retrieval system that supports fast, efficient, keyword based retrieval of genomic sequences from multiple libraries with retrieved sequences post-processed by FASTA, Smith-Waterman and other analysis software. This application is implemented for Linux and is written in Mumps, C, and C++ with supporting components that include the Berkeley Data Base, the Perl Compatible Regular Expression Library, GLADE, and tools such as FASTA, Smith-Waterman, and modules from EMBOSS. The package described here can quickly index data sets of up to 256 terabytes using a B-tree based multi-dimensional data model. An example is presented that indexes the text of the full NCBI Genbank library.
Related Topics
Physical Sciences and Engineering
Computer Science
Computer Science Applications
Authors
Kevin C. O'Kane, Matthew J. Lockner,