Cargando…

GDC 2: Compression of large collections of genomes

The fall of prices of the high-throughput genome sequencing changes the landscape of modern genomics. A number of large scale projects aimed at sequencing many human genomes are in progress. Genome sequencing also becomes an important aid in the personalized medicine. One of the significant side eff...

Descripción completa

Detalles Bibliográficos
Autores principales: Deorowicz, Sebastian, Danek, Agnieszka, Niemiec, Marcin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4479802/
https://www.ncbi.nlm.nih.gov/pubmed/26108279
http://dx.doi.org/10.1038/srep11565
_version_ 1782378063176138752
author Deorowicz, Sebastian
Danek, Agnieszka
Niemiec, Marcin
author_facet Deorowicz, Sebastian
Danek, Agnieszka
Niemiec, Marcin
author_sort Deorowicz, Sebastian
collection PubMed
description The fall of prices of the high-throughput genome sequencing changes the landscape of modern genomics. A number of large scale projects aimed at sequencing many human genomes are in progress. Genome sequencing also becomes an important aid in the personalized medicine. One of the significant side effects of this change is a necessity of storage and transfer of huge amounts of genomic data. In this paper we deal with the problem of compression of large collections of complete genomic sequences. We propose an algorithm that is able to compress the collection of 1092 human diploid genomes about 9,500 times. This result is about 4 times better than what is offered by the other existing compressors. Moreover, our algorithm is very fast as it processes the data with speed 200 MB/s on a modern workstation. In a consequence the proposed algorithm allows storing the complete genomic collections at low cost, e.g., the examined collection of 1092 human genomes needs only about 700 MB when compressed, what can be compared to about 6.7 TB of uncompressed FASTA files. The source code is available at http://sun.aei.polsl.pl/REFRESH/index.php?page=projects&project=gdc&subpage=about.
format Online
Article
Text
id pubmed-4479802
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher Nature Publishing Group
record_format MEDLINE/PubMed
spelling pubmed-44798022015-06-29 GDC 2: Compression of large collections of genomes Deorowicz, Sebastian Danek, Agnieszka Niemiec, Marcin Sci Rep Article The fall of prices of the high-throughput genome sequencing changes the landscape of modern genomics. A number of large scale projects aimed at sequencing many human genomes are in progress. Genome sequencing also becomes an important aid in the personalized medicine. One of the significant side effects of this change is a necessity of storage and transfer of huge amounts of genomic data. In this paper we deal with the problem of compression of large collections of complete genomic sequences. We propose an algorithm that is able to compress the collection of 1092 human diploid genomes about 9,500 times. This result is about 4 times better than what is offered by the other existing compressors. Moreover, our algorithm is very fast as it processes the data with speed 200 MB/s on a modern workstation. In a consequence the proposed algorithm allows storing the complete genomic collections at low cost, e.g., the examined collection of 1092 human genomes needs only about 700 MB when compressed, what can be compared to about 6.7 TB of uncompressed FASTA files. The source code is available at http://sun.aei.polsl.pl/REFRESH/index.php?page=projects&project=gdc&subpage=about. Nature Publishing Group 2015-06-25 /pmc/articles/PMC4479802/ /pubmed/26108279 http://dx.doi.org/10.1038/srep11565 Text en Copyright © 2015, Macmillan Publishers Limited http://creativecommons.org/licenses/by/4.0/ This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/
spellingShingle Article
Deorowicz, Sebastian
Danek, Agnieszka
Niemiec, Marcin
GDC 2: Compression of large collections of genomes
title GDC 2: Compression of large collections of genomes
title_full GDC 2: Compression of large collections of genomes
title_fullStr GDC 2: Compression of large collections of genomes
title_full_unstemmed GDC 2: Compression of large collections of genomes
title_short GDC 2: Compression of large collections of genomes
title_sort gdc 2: compression of large collections of genomes
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4479802/
https://www.ncbi.nlm.nih.gov/pubmed/26108279
http://dx.doi.org/10.1038/srep11565
work_keys_str_mv AT deorowiczsebastian gdc2compressionoflargecollectionsofgenomes
AT danekagnieszka gdc2compressionoflargecollectionsofgenomes
AT niemiecmarcin gdc2compressionoflargecollectionsofgenomes