کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
396472 670352 2016 36 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Practical compressed string dictionaries
ترجمه فارسی عنوان
دیکشنری های رشته فشرده عملی
کلمات کلیدی
واژهنامههای رشته فشرده، پردازش متن، پایگاه داده های متن ساختار داده فشرده
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر هوش مصنوعی
چکیده انگلیسی


• We address the problem of managing string dictionaries in compressed space.
• We combine data structures and compression to propose several competitive solutions.
• Our approaches usually outperform the state-of-the-art techniques on real-world dictionaries.
• All our techniques are implemented and released in a C++ library hosted at GitHub.

The need to store and query a set of strings – a string dictionary – arises in many kinds of applications. While classically these string dictionaries have accounted for a small share of the total space budget (e.g., in Natural Language Processing or when indexing text collections), recent applications in Web engines, Semantic Web (RDF) graphs, Bioinformatics, and many others handle very large string dictionaries, whose size is a significant fraction of the whole data. In these cases, string dictionary management is a scalability issue by itself. This paper focuses on the problem of managing large static string dictionaries in compressed main memory space. We revisit classical solutions for string dictionaries like hashing, tries, and front-coding, and improve them by using compression techniques. We also introduce some novel string dictionary representations built on top of recent advances in succinct data structures and full-text indexes. All these structures are empirically compared on a heterogeneous testbed formed by real-world string dictionaries. We show that the compressed representations may use as little as 5% of the original dictionary size, while supporting lookup operations within a few microseconds. These numbers outperform the state-of-the-art space/time tradeoffs in many cases. Furthermore, we enhance some representations to provide prefix- and substring-based searches, which also perform competitively. The results show that compressed string dictionaries are a useful building block for various data-intensive applications in different domains.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Information Systems - Volume 56, March 2016, Pages 73–108
نویسندگان
, , , , ,