Cargando…

MergedTrie: Efficient textual indexing

The accessing and processing of textual information (i.e. the storing and querying of a set of strings) is especially important for many current applications (e.g. information retrieval and social networks), especially when working in the fields of Big Data or IoT, which require the handling of very...

Descripción completa

Detalles Bibliográficos
Autores principales:	Ferrández, Antonio, Peral, Jesús
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2019
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6478299/ https://www.ncbi.nlm.nih.gov/pubmed/31013282 http://dx.doi.org/10.1371/journal.pone.0215288

_version_	1783413147771600896
author	Ferrández, Antonio Peral, Jesús
author_facet	Ferrández, Antonio Peral, Jesús
author_sort	Ferrández, Antonio
collection	PubMed
description	The accessing and processing of textual information (i.e. the storing and querying of a set of strings) is especially important for many current applications (e.g. information retrieval and social networks), especially when working in the fields of Big Data or IoT, which require the handling of very large string dictionaries. Typical data structures for textual indexing are Hash Tables and some variants of Tries such as the Double Trie (DT). In this paper, we propose an extension of the DT that we have called MergedTrie. It improves the DT compression by merging both Tries into a single and by segmenting the indexed term into two fixed length parts in order to balance the new Trie. Thus, a higher overlapping of both prefixes and suffixes is obtained. Moreover, we propose a new implementation of Tries that achieves better compression rates than the Double-Array representation usually chosen for implementing Tries. Our proposal also overcomes the limitation of static implementations that does not allow insertions and updates in their compact representations. Finally, our MergedTrie implementation experimentally improves the efficiency of the Hash Tables, the DTs, the Double-Array, the Crit-bit, the Directed Acyclic Word Graphs (DAWG), and the Acyclic Deterministic Finite Automata (ADFA) data structures, requiring less space than the original text to be indexed.
format	Online Article Text
id	pubmed-6478299
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-64782992019-05-07 MergedTrie: Efficient textual indexing Ferrández, Antonio Peral, Jesús PLoS One Research Article The accessing and processing of textual information (i.e. the storing and querying of a set of strings) is especially important for many current applications (e.g. information retrieval and social networks), especially when working in the fields of Big Data or IoT, which require the handling of very large string dictionaries. Typical data structures for textual indexing are Hash Tables and some variants of Tries such as the Double Trie (DT). In this paper, we propose an extension of the DT that we have called MergedTrie. It improves the DT compression by merging both Tries into a single and by segmenting the indexed term into two fixed length parts in order to balance the new Trie. Thus, a higher overlapping of both prefixes and suffixes is obtained. Moreover, we propose a new implementation of Tries that achieves better compression rates than the Double-Array representation usually chosen for implementing Tries. Our proposal also overcomes the limitation of static implementations that does not allow insertions and updates in their compact representations. Finally, our MergedTrie implementation experimentally improves the efficiency of the Hash Tables, the DTs, the Double-Array, the Crit-bit, the Directed Acyclic Word Graphs (DAWG), and the Acyclic Deterministic Finite Automata (ADFA) data structures, requiring less space than the original text to be indexed. Public Library of Science 2019-04-23 /pmc/articles/PMC6478299/ /pubmed/31013282 http://dx.doi.org/10.1371/journal.pone.0215288 Text en © 2019 Ferrández, Peral http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Ferrández, Antonio Peral, Jesús MergedTrie: Efficient textual indexing
title	MergedTrie: Efficient textual indexing
title_full	MergedTrie: Efficient textual indexing
title_fullStr	MergedTrie: Efficient textual indexing
title_full_unstemmed	MergedTrie: Efficient textual indexing
title_short	MergedTrie: Efficient textual indexing
title_sort	mergedtrie: efficient textual indexing
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6478299/ https://www.ncbi.nlm.nih.gov/pubmed/31013282 http://dx.doi.org/10.1371/journal.pone.0215288
work_keys_str_mv	AT ferrandezantonio mergedtrieefficienttextualindexing AT peraljesus mergedtrieefficienttextualindexing

MergedTrie: Efficient textual indexing

Ejemplares similares