Cargando…

Data structures based on k-mers for querying large collections of sequencing data sets

High-throughput sequencing data sets are usually deposited in public repositories (e.g., the European Nucleotide Archive) to ensure reproducibility. As the amount of data has reached petabyte scale, repositories do not allow one to perform online sequence searches, yet, such a feature would be highl...

Descripción completa

Detalles Bibliográficos
Autores principales:	Marchet, Camille, Boucher, Christina, Puglisi, Simon J., Medvedev, Paul, Salson, Mikaël, Chikhi, Rayan
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Cold Spring Harbor Laboratory Press 2021
Materias:	Review
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7849385/ https://www.ncbi.nlm.nih.gov/pubmed/33328168 http://dx.doi.org/10.1101/gr.260604.119

_version_	1783645290986733568
author	Marchet, Camille Boucher, Christina Puglisi, Simon J. Medvedev, Paul Salson, Mikaël Chikhi, Rayan
author_facet	Marchet, Camille Boucher, Christina Puglisi, Simon J. Medvedev, Paul Salson, Mikaël Chikhi, Rayan
author_sort	Marchet, Camille
collection	PubMed
description	High-throughput sequencing data sets are usually deposited in public repositories (e.g., the European Nucleotide Archive) to ensure reproducibility. As the amount of data has reached petabyte scale, repositories do not allow one to perform online sequence searches, yet, such a feature would be highly useful to investigators. Toward this goal, in the last few years several computational approaches have been introduced to index and query large collections of data sets. Here, we propose an accessible survey of these approaches, which are generally based on representing data sets as sets of k-mers. We review their properties, introduce a classification, and present their general intuition. We summarize their performance and highlight their current strengths and limitations.
format	Online Article Text
id	pubmed-7849385
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Cold Spring Harbor Laboratory Press
record_format	MEDLINE/PubMed
spelling	pubmed-78493852021-07-01 Data structures based on k-mers for querying large collections of sequencing data sets Marchet, Camille Boucher, Christina Puglisi, Simon J. Medvedev, Paul Salson, Mikaël Chikhi, Rayan Genome Res Review High-throughput sequencing data sets are usually deposited in public repositories (e.g., the European Nucleotide Archive) to ensure reproducibility. As the amount of data has reached petabyte scale, repositories do not allow one to perform online sequence searches, yet, such a feature would be highly useful to investigators. Toward this goal, in the last few years several computational approaches have been introduced to index and query large collections of data sets. Here, we propose an accessible survey of these approaches, which are generally based on representing data sets as sets of k-mers. We review their properties, introduce a classification, and present their general intuition. We summarize their performance and highlight their current strengths and limitations. Cold Spring Harbor Laboratory Press 2021-01 /pmc/articles/PMC7849385/ /pubmed/33328168 http://dx.doi.org/10.1101/gr.260604.119 Text en © 2021 Marchet et al.; Published by Cold Spring Harbor Laboratory Press http://creativecommons.org/licenses/by-nc/4.0/ This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see http://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.
spellingShingle	Review Marchet, Camille Boucher, Christina Puglisi, Simon J. Medvedev, Paul Salson, Mikaël Chikhi, Rayan Data structures based on k-mers for querying large collections of sequencing data sets
title	Data structures based on k-mers for querying large collections of sequencing data sets
title_full	Data structures based on k-mers for querying large collections of sequencing data sets
title_fullStr	Data structures based on k-mers for querying large collections of sequencing data sets
title_full_unstemmed	Data structures based on k-mers for querying large collections of sequencing data sets
title_short	Data structures based on k-mers for querying large collections of sequencing data sets
title_sort	data structures based on k-mers for querying large collections of sequencing data sets
topic	Review
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7849385/ https://www.ncbi.nlm.nih.gov/pubmed/33328168 http://dx.doi.org/10.1101/gr.260604.119
work_keys_str_mv	AT marchetcamille datastructuresbasedonkmersforqueryinglargecollectionsofsequencingdatasets AT boucherchristina datastructuresbasedonkmersforqueryinglargecollectionsofsequencingdatasets AT puglisisimonj datastructuresbasedonkmersforqueryinglargecollectionsofsequencingdatasets AT medvedevpaul datastructuresbasedonkmersforqueryinglargecollectionsofsequencingdatasets AT salsonmikael datastructuresbasedonkmersforqueryinglargecollectionsofsequencingdatasets AT chikhirayan datastructuresbasedonkmersforqueryinglargecollectionsofsequencingdatasets

Data structures based on k-mers for querying large collections of sequencing data sets

Ejemplares similares