Cargando…

Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph

BACKGROUND: Data volumes generated by next-generation sequencing (NGS) technologies is now a major concern for both data storage and transmission. This triggered the need for more efficient methods than general purpose compression tools, such as the widely used gzip method. RESULTS: We present a nov...

Descripción completa

Detalles Bibliográficos
Autores principales: Benoit, Gaëtan, Lemaitre, Claire, Lavenier, Dominique, Drezen, Erwan, Dayris, Thibault, Uricaru, Raluca, Rizk, Guillaume
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4570262/
https://www.ncbi.nlm.nih.gov/pubmed/26370285
http://dx.doi.org/10.1186/s12859-015-0709-7
_version_ 1782390174425022464
author Benoit, Gaëtan
Lemaitre, Claire
Lavenier, Dominique
Drezen, Erwan
Dayris, Thibault
Uricaru, Raluca
Rizk, Guillaume
author_facet Benoit, Gaëtan
Lemaitre, Claire
Lavenier, Dominique
Drezen, Erwan
Dayris, Thibault
Uricaru, Raluca
Rizk, Guillaume
author_sort Benoit, Gaëtan
collection PubMed
description BACKGROUND: Data volumes generated by next-generation sequencing (NGS) technologies is now a major concern for both data storage and transmission. This triggered the need for more efficient methods than general purpose compression tools, such as the widely used gzip method. RESULTS: We present a novel reference-free method meant to compress data issued from high throughput sequencing technologies. Our approach, implemented in the software Leon, employs techniques derived from existing assembly principles. The method is based on a reference probabilistic de Bruijn Graph, built de novo from the set of reads and stored in a Bloom filter. Each read is encoded as a path in this graph, by memorizing an anchoring kmer and a list of bifurcations. The same probabilistic de Bruijn Graph is used to perform a lossy transformation of the quality scores, which allows to obtain higher compression rates without losing pertinent information for downstream analyses. CONCLUSIONS: Leon was run on various real sequencing datasets (whole genome, exome, RNA-seq or metagenomics). In all cases, LEON showed higher overall compression ratios than state-of-the-art compression software. On a C. elegans whole genome sequencing dataset, LEON divided the original file size by more than 20. Leon is an open source software, distributed under GNU affero GPL License, available for download at http://gatb.inria.fr/software/leon/. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0709-7) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4570262
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-45702622015-09-16 Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph Benoit, Gaëtan Lemaitre, Claire Lavenier, Dominique Drezen, Erwan Dayris, Thibault Uricaru, Raluca Rizk, Guillaume BMC Bioinformatics Research Article BACKGROUND: Data volumes generated by next-generation sequencing (NGS) technologies is now a major concern for both data storage and transmission. This triggered the need for more efficient methods than general purpose compression tools, such as the widely used gzip method. RESULTS: We present a novel reference-free method meant to compress data issued from high throughput sequencing technologies. Our approach, implemented in the software Leon, employs techniques derived from existing assembly principles. The method is based on a reference probabilistic de Bruijn Graph, built de novo from the set of reads and stored in a Bloom filter. Each read is encoded as a path in this graph, by memorizing an anchoring kmer and a list of bifurcations. The same probabilistic de Bruijn Graph is used to perform a lossy transformation of the quality scores, which allows to obtain higher compression rates without losing pertinent information for downstream analyses. CONCLUSIONS: Leon was run on various real sequencing datasets (whole genome, exome, RNA-seq or metagenomics). In all cases, LEON showed higher overall compression ratios than state-of-the-art compression software. On a C. elegans whole genome sequencing dataset, LEON divided the original file size by more than 20. Leon is an open source software, distributed under GNU affero GPL License, available for download at http://gatb.inria.fr/software/leon/. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0709-7) contains supplementary material, which is available to authorized users. BioMed Central 2015-09-14 /pmc/articles/PMC4570262/ /pubmed/26370285 http://dx.doi.org/10.1186/s12859-015-0709-7 Text en © Benoit et al. 2015 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Benoit, Gaëtan
Lemaitre, Claire
Lavenier, Dominique
Drezen, Erwan
Dayris, Thibault
Uricaru, Raluca
Rizk, Guillaume
Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph
title Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph
title_full Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph
title_fullStr Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph
title_full_unstemmed Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph
title_short Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph
title_sort reference-free compression of high throughput sequencing data with a probabilistic de bruijn graph
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4570262/
https://www.ncbi.nlm.nih.gov/pubmed/26370285
http://dx.doi.org/10.1186/s12859-015-0709-7
work_keys_str_mv AT benoitgaetan referencefreecompressionofhighthroughputsequencingdatawithaprobabilisticdebruijngraph
AT lemaitreclaire referencefreecompressionofhighthroughputsequencingdatawithaprobabilisticdebruijngraph
AT lavenierdominique referencefreecompressionofhighthroughputsequencingdatawithaprobabilisticdebruijngraph
AT drezenerwan referencefreecompressionofhighthroughputsequencingdatawithaprobabilisticdebruijngraph
AT dayristhibault referencefreecompressionofhighthroughputsequencingdatawithaprobabilisticdebruijngraph
AT uricaruraluca referencefreecompressionofhighthroughputsequencingdatawithaprobabilisticdebruijngraph
AT rizkguillaume referencefreecompressionofhighthroughputsequencingdatawithaprobabilisticdebruijngraph