Cargando…

Document retrieval on repetitive string collections

Most of the fastest-growing string collections today are repetitive, that is, most of the constituent documents are similar to many others. As these collections keep growing, a key approach to handling them is to exploit their repetitiveness, which can reduce their space usage by orders of magnitude...

Descripción completa

Detalles Bibliográficos
Autores principales:	Gagie, Travis, Hartikainen, Aleksi, Karhu, Kalle, Kärkkäinen, Juha, Navarro, Gonzalo, Puglisi, Simon J., Sirén, Jouni
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer Netherlands 2017
Materias:	Information Retrieval Efficiency
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5445192/ https://www.ncbi.nlm.nih.gov/pubmed/28596702 http://dx.doi.org/10.1007/s10791-017-9297-7

_version_	1783238833823809536
author	Gagie, Travis Hartikainen, Aleksi Karhu, Kalle Kärkkäinen, Juha Navarro, Gonzalo Puglisi, Simon J. Sirén, Jouni
author_facet	Gagie, Travis Hartikainen, Aleksi Karhu, Kalle Kärkkäinen, Juha Navarro, Gonzalo Puglisi, Simon J. Sirén, Jouni
author_sort	Gagie, Travis
collection	PubMed
description	Most of the fastest-growing string collections today are repetitive, that is, most of the constituent documents are similar to many others. As these collections keep growing, a key approach to handling them is to exploit their repetitiveness, which can reduce their space usage by orders of magnitude. We study the problem of indexing repetitive string collections in order to perform efficient document retrieval operations on them. Document retrieval problems are routinely solved by search engines on large natural language collections, but the techniques are less developed on generic string collections. The case of repetitive string collections is even less understood, and there are very few existing solutions. We develop two novel ideas, interleaved LCPs and precomputed document lists, that yield highly compressed indexes solving the problem of document listing (find all the documents where a string appears), top-k document retrieval (find the k documents where a string appears most often), and document counting (count the number of documents where a string appears). We also show that a classical data structure supporting the latter query becomes highly compressible on repetitive data. Finally, we show how the tools we developed can be combined to solve ranked conjunctive and disjunctive multi-term queries under the simple [Formula: see text] model of relevance. We thoroughly evaluate the resulting techniques in various real-life repetitiveness scenarios, and recommend the best choices for each case.
format	Online Article Text
id	pubmed-5445192
institution	National Center for Biotechnology Information
language	English
publishDate	2017
publisher	Springer Netherlands
record_format	MEDLINE/PubMed
spelling	pubmed-54451922017-06-06 Document retrieval on repetitive string collections Gagie, Travis Hartikainen, Aleksi Karhu, Kalle Kärkkäinen, Juha Navarro, Gonzalo Puglisi, Simon J. Sirén, Jouni Inf Retr Boston Information Retrieval Efficiency Most of the fastest-growing string collections today are repetitive, that is, most of the constituent documents are similar to many others. As these collections keep growing, a key approach to handling them is to exploit their repetitiveness, which can reduce their space usage by orders of magnitude. We study the problem of indexing repetitive string collections in order to perform efficient document retrieval operations on them. Document retrieval problems are routinely solved by search engines on large natural language collections, but the techniques are less developed on generic string collections. The case of repetitive string collections is even less understood, and there are very few existing solutions. We develop two novel ideas, interleaved LCPs and precomputed document lists, that yield highly compressed indexes solving the problem of document listing (find all the documents where a string appears), top-k document retrieval (find the k documents where a string appears most often), and document counting (count the number of documents where a string appears). We also show that a classical data structure supporting the latter query becomes highly compressible on repetitive data. Finally, we show how the tools we developed can be combined to solve ranked conjunctive and disjunctive multi-term queries under the simple [Formula: see text] model of relevance. We thoroughly evaluate the resulting techniques in various real-life repetitiveness scenarios, and recommend the best choices for each case. Springer Netherlands 2017-04-01 2017 /pmc/articles/PMC5445192/ /pubmed/28596702 http://dx.doi.org/10.1007/s10791-017-9297-7 Text en © The Author(s) 2017 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
spellingShingle	Information Retrieval Efficiency Gagie, Travis Hartikainen, Aleksi Karhu, Kalle Kärkkäinen, Juha Navarro, Gonzalo Puglisi, Simon J. Sirén, Jouni Document retrieval on repetitive string collections
title	Document retrieval on repetitive string collections
title_full	Document retrieval on repetitive string collections
title_fullStr	Document retrieval on repetitive string collections
title_full_unstemmed	Document retrieval on repetitive string collections
title_short	Document retrieval on repetitive string collections
title_sort	document retrieval on repetitive string collections
topic	Information Retrieval Efficiency
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5445192/ https://www.ncbi.nlm.nih.gov/pubmed/28596702 http://dx.doi.org/10.1007/s10791-017-9297-7
work_keys_str_mv	AT gagietravis documentretrievalonrepetitivestringcollections AT hartikainenaleksi documentretrievalonrepetitivestringcollections AT karhukalle documentretrievalonrepetitivestringcollections AT karkkainenjuha documentretrievalonrepetitivestringcollections AT navarrogonzalo documentretrievalonrepetitivestringcollections AT puglisisimonj documentretrievalonrepetitivestringcollections AT sirenjouni documentretrievalonrepetitivestringcollections

Document retrieval on repetitive string collections

Ejemplares similares