Cargando…

REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets

MOTIVATION: In this work we present REINDEER, a novel computational method that performs indexing of sequences and records their abundances across a collection of datasets. To the best of our knowledge, other indexing methods have so far been unable to record abundances efficiently across large data...

Descripción completa

Detalles Bibliográficos
Autores principales:	Marchet, Camille, Iqbal, Zamin, Gautheret, Daniel, Salson, Mikaël, Chikhi, Rayan
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2020
Materias:	Genomic Variation Analysis
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7355249/ https://www.ncbi.nlm.nih.gov/pubmed/32657392 http://dx.doi.org/10.1093/bioinformatics/btaa487

_version_	1783558236686778368
author	Marchet, Camille Iqbal, Zamin Gautheret, Daniel Salson, Mikaël Chikhi, Rayan
author_facet	Marchet, Camille Iqbal, Zamin Gautheret, Daniel Salson, Mikaël Chikhi, Rayan
author_sort	Marchet, Camille
collection	PubMed
description	MOTIVATION: In this work we present REINDEER, a novel computational method that performs indexing of sequences and records their abundances across a collection of datasets. To the best of our knowledge, other indexing methods have so far been unable to record abundances efficiently across large datasets. RESULTS: We used REINDEER to index the abundances of sequences within 2585 human RNA-seq experiments in 45 h using only 56 GB of RAM. This makes REINDEER the first method able to record abundances at the scale of ∼4 billion distinct k-mers across 2585 datasets. REINDEER also supports exact presence/absence queries of k-mers. Briefly, REINDEER constructs the compacted de Bruijn graph of each dataset, then conceptually merges those de Bruijn graphs into a single global one. Then, REINDEER constructs and indexes monotigs, which in a nutshell are groups of k-mers of similar abundances. AVAILABILITY AND IMPLEMENTATION: https://github.com/kamimrcht/REINDEER. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format	Online Article Text
id	pubmed-7355249
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-73552492020-07-16 REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets Marchet, Camille Iqbal, Zamin Gautheret, Daniel Salson, Mikaël Chikhi, Rayan Bioinformatics Genomic Variation Analysis MOTIVATION: In this work we present REINDEER, a novel computational method that performs indexing of sequences and records their abundances across a collection of datasets. To the best of our knowledge, other indexing methods have so far been unable to record abundances efficiently across large datasets. RESULTS: We used REINDEER to index the abundances of sequences within 2585 human RNA-seq experiments in 45 h using only 56 GB of RAM. This makes REINDEER the first method able to record abundances at the scale of ∼4 billion distinct k-mers across 2585 datasets. REINDEER also supports exact presence/absence queries of k-mers. Briefly, REINDEER constructs the compacted de Bruijn graph of each dataset, then conceptually merges those de Bruijn graphs into a single global one. Then, REINDEER constructs and indexes monotigs, which in a nutshell are groups of k-mers of similar abundances. AVAILABILITY AND IMPLEMENTATION: https://github.com/kamimrcht/REINDEER. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2020-07 2020-07-13 /pmc/articles/PMC7355249/ /pubmed/32657392 http://dx.doi.org/10.1093/bioinformatics/btaa487 Text en © The Author(s) 2020. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle	Genomic Variation Analysis Marchet, Camille Iqbal, Zamin Gautheret, Daniel Salson, Mikaël Chikhi, Rayan REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets
title	REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets
title_full	REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets
title_fullStr	REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets
title_full_unstemmed	REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets
title_short	REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets
title_sort	reindeer: efficient indexing of k-mer presence and abundance in sequencing datasets
topic	Genomic Variation Analysis
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7355249/ https://www.ncbi.nlm.nih.gov/pubmed/32657392 http://dx.doi.org/10.1093/bioinformatics/btaa487
work_keys_str_mv	AT marchetcamille reindeerefficientindexingofkmerpresenceandabundanceinsequencingdatasets AT iqbalzamin reindeerefficientindexingofkmerpresenceandabundanceinsequencingdatasets AT gautheretdaniel reindeerefficientindexingofkmerpresenceandabundanceinsequencingdatasets AT salsonmikael reindeerefficientindexingofkmerpresenceandabundanceinsequencingdatasets AT chikhirayan reindeerefficientindexingofkmerpresenceandabundanceinsequencingdatasets

REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets

Ejemplares similares