Cargando…

Data structures based on k-mers for querying large collections of sequencing data sets

High-throughput sequencing data sets are usually deposited in public repositories (e.g., the European Nucleotide Archive) to ensure reproducibility. As the amount of data has reached petabyte scale, repositories do not allow one to perform online sequence searches, yet, such a feature would be highl...

Descripción completa

Detalles Bibliográficos
Autores principales: Marchet, Camille, Boucher, Christina, Puglisi, Simon J., Medvedev, Paul, Salson, Mikaël, Chikhi, Rayan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cold Spring Harbor Laboratory Press 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7849385/
https://www.ncbi.nlm.nih.gov/pubmed/33328168
http://dx.doi.org/10.1101/gr.260604.119
_version_ 1783645290986733568
author Marchet, Camille
Boucher, Christina
Puglisi, Simon J.
Medvedev, Paul
Salson, Mikaël
Chikhi, Rayan
author_facet Marchet, Camille
Boucher, Christina
Puglisi, Simon J.
Medvedev, Paul
Salson, Mikaël
Chikhi, Rayan
author_sort Marchet, Camille
collection PubMed
description High-throughput sequencing data sets are usually deposited in public repositories (e.g., the European Nucleotide Archive) to ensure reproducibility. As the amount of data has reached petabyte scale, repositories do not allow one to perform online sequence searches, yet, such a feature would be highly useful to investigators. Toward this goal, in the last few years several computational approaches have been introduced to index and query large collections of data sets. Here, we propose an accessible survey of these approaches, which are generally based on representing data sets as sets of k-mers. We review their properties, introduce a classification, and present their general intuition. We summarize their performance and highlight their current strengths and limitations.
format Online
Article
Text
id pubmed-7849385
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Cold Spring Harbor Laboratory Press
record_format MEDLINE/PubMed
spelling pubmed-78493852021-07-01 Data structures based on k-mers for querying large collections of sequencing data sets Marchet, Camille Boucher, Christina Puglisi, Simon J. Medvedev, Paul Salson, Mikaël Chikhi, Rayan Genome Res Review High-throughput sequencing data sets are usually deposited in public repositories (e.g., the European Nucleotide Archive) to ensure reproducibility. As the amount of data has reached petabyte scale, repositories do not allow one to perform online sequence searches, yet, such a feature would be highly useful to investigators. Toward this goal, in the last few years several computational approaches have been introduced to index and query large collections of data sets. Here, we propose an accessible survey of these approaches, which are generally based on representing data sets as sets of k-mers. We review their properties, introduce a classification, and present their general intuition. We summarize their performance and highlight their current strengths and limitations. Cold Spring Harbor Laboratory Press 2021-01 /pmc/articles/PMC7849385/ /pubmed/33328168 http://dx.doi.org/10.1101/gr.260604.119 Text en © 2021 Marchet et al.; Published by Cold Spring Harbor Laboratory Press http://creativecommons.org/licenses/by-nc/4.0/ This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see http://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.
spellingShingle Review
Marchet, Camille
Boucher, Christina
Puglisi, Simon J.
Medvedev, Paul
Salson, Mikaël
Chikhi, Rayan
Data structures based on k-mers for querying large collections of sequencing data sets
title Data structures based on k-mers for querying large collections of sequencing data sets
title_full Data structures based on k-mers for querying large collections of sequencing data sets
title_fullStr Data structures based on k-mers for querying large collections of sequencing data sets
title_full_unstemmed Data structures based on k-mers for querying large collections of sequencing data sets
title_short Data structures based on k-mers for querying large collections of sequencing data sets
title_sort data structures based on k-mers for querying large collections of sequencing data sets
topic Review
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7849385/
https://www.ncbi.nlm.nih.gov/pubmed/33328168
http://dx.doi.org/10.1101/gr.260604.119
work_keys_str_mv AT marchetcamille datastructuresbasedonkmersforqueryinglargecollectionsofsequencingdatasets
AT boucherchristina datastructuresbasedonkmersforqueryinglargecollectionsofsequencingdatasets
AT puglisisimonj datastructuresbasedonkmersforqueryinglargecollectionsofsequencingdatasets
AT medvedevpaul datastructuresbasedonkmersforqueryinglargecollectionsofsequencingdatasets
AT salsonmikael datastructuresbasedonkmersforqueryinglargecollectionsofsequencingdatasets
AT chikhirayan datastructuresbasedonkmersforqueryinglargecollectionsofsequencingdatasets