Cargando…
Data structures based on k-mers for querying large collections of sequencing data sets
High-throughput sequencing data sets are usually deposited in public repositories (e.g., the European Nucleotide Archive) to ensure reproducibility. As the amount of data has reached petabyte scale, repositories do not allow one to perform online sequence searches, yet, such a feature would be highl...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Cold Spring Harbor Laboratory Press
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7849385/ https://www.ncbi.nlm.nih.gov/pubmed/33328168 http://dx.doi.org/10.1101/gr.260604.119 |
_version_ | 1783645290986733568 |
---|---|
author | Marchet, Camille Boucher, Christina Puglisi, Simon J. Medvedev, Paul Salson, Mikaël Chikhi, Rayan |
author_facet | Marchet, Camille Boucher, Christina Puglisi, Simon J. Medvedev, Paul Salson, Mikaël Chikhi, Rayan |
author_sort | Marchet, Camille |
collection | PubMed |
description | High-throughput sequencing data sets are usually deposited in public repositories (e.g., the European Nucleotide Archive) to ensure reproducibility. As the amount of data has reached petabyte scale, repositories do not allow one to perform online sequence searches, yet, such a feature would be highly useful to investigators. Toward this goal, in the last few years several computational approaches have been introduced to index and query large collections of data sets. Here, we propose an accessible survey of these approaches, which are generally based on representing data sets as sets of k-mers. We review their properties, introduce a classification, and present their general intuition. We summarize their performance and highlight their current strengths and limitations. |
format | Online Article Text |
id | pubmed-7849385 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | Cold Spring Harbor Laboratory Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-78493852021-07-01 Data structures based on k-mers for querying large collections of sequencing data sets Marchet, Camille Boucher, Christina Puglisi, Simon J. Medvedev, Paul Salson, Mikaël Chikhi, Rayan Genome Res Review High-throughput sequencing data sets are usually deposited in public repositories (e.g., the European Nucleotide Archive) to ensure reproducibility. As the amount of data has reached petabyte scale, repositories do not allow one to perform online sequence searches, yet, such a feature would be highly useful to investigators. Toward this goal, in the last few years several computational approaches have been introduced to index and query large collections of data sets. Here, we propose an accessible survey of these approaches, which are generally based on representing data sets as sets of k-mers. We review their properties, introduce a classification, and present their general intuition. We summarize their performance and highlight their current strengths and limitations. Cold Spring Harbor Laboratory Press 2021-01 /pmc/articles/PMC7849385/ /pubmed/33328168 http://dx.doi.org/10.1101/gr.260604.119 Text en © 2021 Marchet et al.; Published by Cold Spring Harbor Laboratory Press http://creativecommons.org/licenses/by-nc/4.0/ This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see http://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/. |
spellingShingle | Review Marchet, Camille Boucher, Christina Puglisi, Simon J. Medvedev, Paul Salson, Mikaël Chikhi, Rayan Data structures based on k-mers for querying large collections of sequencing data sets |
title | Data structures based on k-mers for querying large collections of sequencing data sets |
title_full | Data structures based on k-mers for querying large collections of sequencing data sets |
title_fullStr | Data structures based on k-mers for querying large collections of sequencing data sets |
title_full_unstemmed | Data structures based on k-mers for querying large collections of sequencing data sets |
title_short | Data structures based on k-mers for querying large collections of sequencing data sets |
title_sort | data structures based on k-mers for querying large collections of sequencing data sets |
topic | Review |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7849385/ https://www.ncbi.nlm.nih.gov/pubmed/33328168 http://dx.doi.org/10.1101/gr.260604.119 |
work_keys_str_mv | AT marchetcamille datastructuresbasedonkmersforqueryinglargecollectionsofsequencingdatasets AT boucherchristina datastructuresbasedonkmersforqueryinglargecollectionsofsequencingdatasets AT puglisisimonj datastructuresbasedonkmersforqueryinglargecollectionsofsequencingdatasets AT medvedevpaul datastructuresbasedonkmersforqueryinglargecollectionsofsequencingdatasets AT salsonmikael datastructuresbasedonkmersforqueryinglargecollectionsofsequencingdatasets AT chikhirayan datastructuresbasedonkmersforqueryinglargecollectionsofsequencingdatasets |