Cargando…

Optimal compressed representation of high throughput sequence data via light assembly

The most effective genomic data compression methods either assemble reads into contigs, or replace them with their alignment positions on a reference genome. Such methods require significant computational resources, but faster alternatives that avoid using explicit or de novo-constructed references...

Descripción completa

Detalles Bibliográficos
Autores principales: Ginart, Antonio A., Hui, Joseph, Zhu, Kaiyuan, Numanagić, Ibrahim, Courtade, Thomas A., Sahinalp, S. Cenk, Tse, David N.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group UK 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5805770/
https://www.ncbi.nlm.nih.gov/pubmed/29422526
http://dx.doi.org/10.1038/s41467-017-02480-6
_version_ 1783299023507030016
author Ginart, Antonio A.
Hui, Joseph
Zhu, Kaiyuan
Numanagić, Ibrahim
Courtade, Thomas A.
Sahinalp, S. Cenk
Tse, David N.
author_facet Ginart, Antonio A.
Hui, Joseph
Zhu, Kaiyuan
Numanagić, Ibrahim
Courtade, Thomas A.
Sahinalp, S. Cenk
Tse, David N.
author_sort Ginart, Antonio A.
collection PubMed
description The most effective genomic data compression methods either assemble reads into contigs, or replace them with their alignment positions on a reference genome. Such methods require significant computational resources, but faster alternatives that avoid using explicit or de novo-constructed references fail to match their performance. Here, we introduce a new reference-free compressed representation for genomic data based on light de novo assembly of reads, where each read is represented as a node in a (compact) trie. We show how to efficiently build such tries to compactly represent reads and demonstrate that among all methods using this representation (including all de novo assembly based methods), our method achieves the shortest possible output. We also provide an lower bound on the compression rate achievable on uniformly sampled genomic read data, which is approximated by our method well. Our method significantly improves the compression performance of alternatives without compromising speed.
format Online
Article
Text
id pubmed-5805770
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher Nature Publishing Group UK
record_format MEDLINE/PubMed
spelling pubmed-58057702018-02-12 Optimal compressed representation of high throughput sequence data via light assembly Ginart, Antonio A. Hui, Joseph Zhu, Kaiyuan Numanagić, Ibrahim Courtade, Thomas A. Sahinalp, S. Cenk Tse, David N. Nat Commun Article The most effective genomic data compression methods either assemble reads into contigs, or replace them with their alignment positions on a reference genome. Such methods require significant computational resources, but faster alternatives that avoid using explicit or de novo-constructed references fail to match their performance. Here, we introduce a new reference-free compressed representation for genomic data based on light de novo assembly of reads, where each read is represented as a node in a (compact) trie. We show how to efficiently build such tries to compactly represent reads and demonstrate that among all methods using this representation (including all de novo assembly based methods), our method achieves the shortest possible output. We also provide an lower bound on the compression rate achievable on uniformly sampled genomic read data, which is approximated by our method well. Our method significantly improves the compression performance of alternatives without compromising speed. Nature Publishing Group UK 2018-02-08 /pmc/articles/PMC5805770/ /pubmed/29422526 http://dx.doi.org/10.1038/s41467-017-02480-6 Text en © The Author(s) 2018 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
spellingShingle Article
Ginart, Antonio A.
Hui, Joseph
Zhu, Kaiyuan
Numanagić, Ibrahim
Courtade, Thomas A.
Sahinalp, S. Cenk
Tse, David N.
Optimal compressed representation of high throughput sequence data via light assembly
title Optimal compressed representation of high throughput sequence data via light assembly
title_full Optimal compressed representation of high throughput sequence data via light assembly
title_fullStr Optimal compressed representation of high throughput sequence data via light assembly
title_full_unstemmed Optimal compressed representation of high throughput sequence data via light assembly
title_short Optimal compressed representation of high throughput sequence data via light assembly
title_sort optimal compressed representation of high throughput sequence data via light assembly
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5805770/
https://www.ncbi.nlm.nih.gov/pubmed/29422526
http://dx.doi.org/10.1038/s41467-017-02480-6
work_keys_str_mv AT ginartantonioa optimalcompressedrepresentationofhighthroughputsequencedatavialightassembly
AT huijoseph optimalcompressedrepresentationofhighthroughputsequencedatavialightassembly
AT zhukaiyuan optimalcompressedrepresentationofhighthroughputsequencedatavialightassembly
AT numanagicibrahim optimalcompressedrepresentationofhighthroughputsequencedatavialightassembly
AT courtadethomasa optimalcompressedrepresentationofhighthroughputsequencedatavialightassembly
AT sahinalpscenk optimalcompressedrepresentationofhighthroughputsequencedatavialightassembly
AT tsedavidn optimalcompressedrepresentationofhighthroughputsequencedatavialightassembly