Cargando…
Document retrieval on repetitive string collections
Most of the fastest-growing string collections today are repetitive, that is, most of the constituent documents are similar to many others. As these collections keep growing, a key approach to handling them is to exploit their repetitiveness, which can reduce their space usage by orders of magnitude...
Autores principales: | , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Springer Netherlands
2017
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5445192/ https://www.ncbi.nlm.nih.gov/pubmed/28596702 http://dx.doi.org/10.1007/s10791-017-9297-7 |
_version_ | 1783238833823809536 |
---|---|
author | Gagie, Travis Hartikainen, Aleksi Karhu, Kalle Kärkkäinen, Juha Navarro, Gonzalo Puglisi, Simon J. Sirén, Jouni |
author_facet | Gagie, Travis Hartikainen, Aleksi Karhu, Kalle Kärkkäinen, Juha Navarro, Gonzalo Puglisi, Simon J. Sirén, Jouni |
author_sort | Gagie, Travis |
collection | PubMed |
description | Most of the fastest-growing string collections today are repetitive, that is, most of the constituent documents are similar to many others. As these collections keep growing, a key approach to handling them is to exploit their repetitiveness, which can reduce their space usage by orders of magnitude. We study the problem of indexing repetitive string collections in order to perform efficient document retrieval operations on them. Document retrieval problems are routinely solved by search engines on large natural language collections, but the techniques are less developed on generic string collections. The case of repetitive string collections is even less understood, and there are very few existing solutions. We develop two novel ideas, interleaved LCPs and precomputed document lists, that yield highly compressed indexes solving the problem of document listing (find all the documents where a string appears), top-k document retrieval (find the k documents where a string appears most often), and document counting (count the number of documents where a string appears). We also show that a classical data structure supporting the latter query becomes highly compressible on repetitive data. Finally, we show how the tools we developed can be combined to solve ranked conjunctive and disjunctive multi-term queries under the simple [Formula: see text] model of relevance. We thoroughly evaluate the resulting techniques in various real-life repetitiveness scenarios, and recommend the best choices for each case. |
format | Online Article Text |
id | pubmed-5445192 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2017 |
publisher | Springer Netherlands |
record_format | MEDLINE/PubMed |
spelling | pubmed-54451922017-06-06 Document retrieval on repetitive string collections Gagie, Travis Hartikainen, Aleksi Karhu, Kalle Kärkkäinen, Juha Navarro, Gonzalo Puglisi, Simon J. Sirén, Jouni Inf Retr Boston Information Retrieval Efficiency Most of the fastest-growing string collections today are repetitive, that is, most of the constituent documents are similar to many others. As these collections keep growing, a key approach to handling them is to exploit their repetitiveness, which can reduce their space usage by orders of magnitude. We study the problem of indexing repetitive string collections in order to perform efficient document retrieval operations on them. Document retrieval problems are routinely solved by search engines on large natural language collections, but the techniques are less developed on generic string collections. The case of repetitive string collections is even less understood, and there are very few existing solutions. We develop two novel ideas, interleaved LCPs and precomputed document lists, that yield highly compressed indexes solving the problem of document listing (find all the documents where a string appears), top-k document retrieval (find the k documents where a string appears most often), and document counting (count the number of documents where a string appears). We also show that a classical data structure supporting the latter query becomes highly compressible on repetitive data. Finally, we show how the tools we developed can be combined to solve ranked conjunctive and disjunctive multi-term queries under the simple [Formula: see text] model of relevance. We thoroughly evaluate the resulting techniques in various real-life repetitiveness scenarios, and recommend the best choices for each case. Springer Netherlands 2017-04-01 2017 /pmc/articles/PMC5445192/ /pubmed/28596702 http://dx.doi.org/10.1007/s10791-017-9297-7 Text en © The Author(s) 2017 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. |
spellingShingle | Information Retrieval Efficiency Gagie, Travis Hartikainen, Aleksi Karhu, Kalle Kärkkäinen, Juha Navarro, Gonzalo Puglisi, Simon J. Sirén, Jouni Document retrieval on repetitive string collections |
title | Document retrieval on repetitive string collections |
title_full | Document retrieval on repetitive string collections |
title_fullStr | Document retrieval on repetitive string collections |
title_full_unstemmed | Document retrieval on repetitive string collections |
title_short | Document retrieval on repetitive string collections |
title_sort | document retrieval on repetitive string collections |
topic | Information Retrieval Efficiency |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5445192/ https://www.ncbi.nlm.nih.gov/pubmed/28596702 http://dx.doi.org/10.1007/s10791-017-9297-7 |
work_keys_str_mv | AT gagietravis documentretrievalonrepetitivestringcollections AT hartikainenaleksi documentretrievalonrepetitivestringcollections AT karhukalle documentretrievalonrepetitivestringcollections AT karkkainenjuha documentretrievalonrepetitivestringcollections AT navarrogonzalo documentretrievalonrepetitivestringcollections AT puglisisimonj documentretrievalonrepetitivestringcollections AT sirenjouni documentretrievalonrepetitivestringcollections |