Cargando…

Document retrieval on repetitive string collections

Most of the fastest-growing string collections today are repetitive, that is, most of the constituent documents are similar to many others. As these collections keep growing, a key approach to handling them is to exploit their repetitiveness, which can reduce their space usage by orders of magnitude...

Descripción completa

Detalles Bibliográficos
Autores principales: Gagie, Travis, Hartikainen, Aleksi, Karhu, Kalle, Kärkkäinen, Juha, Navarro, Gonzalo, Puglisi, Simon J., Sirén, Jouni
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer Netherlands 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5445192/
https://www.ncbi.nlm.nih.gov/pubmed/28596702
http://dx.doi.org/10.1007/s10791-017-9297-7
_version_ 1783238833823809536
author Gagie, Travis
Hartikainen, Aleksi
Karhu, Kalle
Kärkkäinen, Juha
Navarro, Gonzalo
Puglisi, Simon J.
Sirén, Jouni
author_facet Gagie, Travis
Hartikainen, Aleksi
Karhu, Kalle
Kärkkäinen, Juha
Navarro, Gonzalo
Puglisi, Simon J.
Sirén, Jouni
author_sort Gagie, Travis
collection PubMed
description Most of the fastest-growing string collections today are repetitive, that is, most of the constituent documents are similar to many others. As these collections keep growing, a key approach to handling them is to exploit their repetitiveness, which can reduce their space usage by orders of magnitude. We study the problem of indexing repetitive string collections in order to perform efficient document retrieval operations on them. Document retrieval problems are routinely solved by search engines on large natural language collections, but the techniques are less developed on generic string collections. The case of repetitive string collections is even less understood, and there are very few existing solutions. We develop two novel ideas, interleaved LCPs and precomputed document lists, that yield highly compressed indexes solving the problem of document listing (find all the documents where a string appears), top-k document retrieval (find the k documents where a string appears most often), and document counting (count the number of documents where a string appears). We also show that a classical data structure supporting the latter query becomes highly compressible on repetitive data. Finally, we show how the tools we developed can be combined to solve ranked conjunctive and disjunctive multi-term queries under the simple [Formula: see text] model of relevance. We thoroughly evaluate the resulting techniques in various real-life repetitiveness scenarios, and recommend the best choices for each case.
format Online
Article
Text
id pubmed-5445192
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Springer Netherlands
record_format MEDLINE/PubMed
spelling pubmed-54451922017-06-06 Document retrieval on repetitive string collections Gagie, Travis Hartikainen, Aleksi Karhu, Kalle Kärkkäinen, Juha Navarro, Gonzalo Puglisi, Simon J. Sirén, Jouni Inf Retr Boston Information Retrieval Efficiency Most of the fastest-growing string collections today are repetitive, that is, most of the constituent documents are similar to many others. As these collections keep growing, a key approach to handling them is to exploit their repetitiveness, which can reduce their space usage by orders of magnitude. We study the problem of indexing repetitive string collections in order to perform efficient document retrieval operations on them. Document retrieval problems are routinely solved by search engines on large natural language collections, but the techniques are less developed on generic string collections. The case of repetitive string collections is even less understood, and there are very few existing solutions. We develop two novel ideas, interleaved LCPs and precomputed document lists, that yield highly compressed indexes solving the problem of document listing (find all the documents where a string appears), top-k document retrieval (find the k documents where a string appears most often), and document counting (count the number of documents where a string appears). We also show that a classical data structure supporting the latter query becomes highly compressible on repetitive data. Finally, we show how the tools we developed can be combined to solve ranked conjunctive and disjunctive multi-term queries under the simple [Formula: see text] model of relevance. We thoroughly evaluate the resulting techniques in various real-life repetitiveness scenarios, and recommend the best choices for each case. Springer Netherlands 2017-04-01 2017 /pmc/articles/PMC5445192/ /pubmed/28596702 http://dx.doi.org/10.1007/s10791-017-9297-7 Text en © The Author(s) 2017 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
spellingShingle Information Retrieval Efficiency
Gagie, Travis
Hartikainen, Aleksi
Karhu, Kalle
Kärkkäinen, Juha
Navarro, Gonzalo
Puglisi, Simon J.
Sirén, Jouni
Document retrieval on repetitive string collections
title Document retrieval on repetitive string collections
title_full Document retrieval on repetitive string collections
title_fullStr Document retrieval on repetitive string collections
title_full_unstemmed Document retrieval on repetitive string collections
title_short Document retrieval on repetitive string collections
title_sort document retrieval on repetitive string collections
topic Information Retrieval Efficiency
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5445192/
https://www.ncbi.nlm.nih.gov/pubmed/28596702
http://dx.doi.org/10.1007/s10791-017-9297-7
work_keys_str_mv AT gagietravis documentretrievalonrepetitivestringcollections
AT hartikainenaleksi documentretrievalonrepetitivestringcollections
AT karhukalle documentretrievalonrepetitivestringcollections
AT karkkainenjuha documentretrievalonrepetitivestringcollections
AT navarrogonzalo documentretrievalonrepetitivestringcollections
AT puglisisimonj documentretrievalonrepetitivestringcollections
AT sirenjouni documentretrievalonrepetitivestringcollections