Cargando…
GDC 2: Compression of large collections of genomes
The fall of prices of the high-throughput genome sequencing changes the landscape of modern genomics. A number of large scale projects aimed at sequencing many human genomes are in progress. Genome sequencing also becomes an important aid in the personalized medicine. One of the significant side eff...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Nature Publishing Group
2015
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4479802/ https://www.ncbi.nlm.nih.gov/pubmed/26108279 http://dx.doi.org/10.1038/srep11565 |
_version_ | 1782378063176138752 |
---|---|
author | Deorowicz, Sebastian Danek, Agnieszka Niemiec, Marcin |
author_facet | Deorowicz, Sebastian Danek, Agnieszka Niemiec, Marcin |
author_sort | Deorowicz, Sebastian |
collection | PubMed |
description | The fall of prices of the high-throughput genome sequencing changes the landscape of modern genomics. A number of large scale projects aimed at sequencing many human genomes are in progress. Genome sequencing also becomes an important aid in the personalized medicine. One of the significant side effects of this change is a necessity of storage and transfer of huge amounts of genomic data. In this paper we deal with the problem of compression of large collections of complete genomic sequences. We propose an algorithm that is able to compress the collection of 1092 human diploid genomes about 9,500 times. This result is about 4 times better than what is offered by the other existing compressors. Moreover, our algorithm is very fast as it processes the data with speed 200 MB/s on a modern workstation. In a consequence the proposed algorithm allows storing the complete genomic collections at low cost, e.g., the examined collection of 1092 human genomes needs only about 700 MB when compressed, what can be compared to about 6.7 TB of uncompressed FASTA files. The source code is available at http://sun.aei.polsl.pl/REFRESH/index.php?page=projects&project=gdc&subpage=about. |
format | Online Article Text |
id | pubmed-4479802 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2015 |
publisher | Nature Publishing Group |
record_format | MEDLINE/PubMed |
spelling | pubmed-44798022015-06-29 GDC 2: Compression of large collections of genomes Deorowicz, Sebastian Danek, Agnieszka Niemiec, Marcin Sci Rep Article The fall of prices of the high-throughput genome sequencing changes the landscape of modern genomics. A number of large scale projects aimed at sequencing many human genomes are in progress. Genome sequencing also becomes an important aid in the personalized medicine. One of the significant side effects of this change is a necessity of storage and transfer of huge amounts of genomic data. In this paper we deal with the problem of compression of large collections of complete genomic sequences. We propose an algorithm that is able to compress the collection of 1092 human diploid genomes about 9,500 times. This result is about 4 times better than what is offered by the other existing compressors. Moreover, our algorithm is very fast as it processes the data with speed 200 MB/s on a modern workstation. In a consequence the proposed algorithm allows storing the complete genomic collections at low cost, e.g., the examined collection of 1092 human genomes needs only about 700 MB when compressed, what can be compared to about 6.7 TB of uncompressed FASTA files. The source code is available at http://sun.aei.polsl.pl/REFRESH/index.php?page=projects&project=gdc&subpage=about. Nature Publishing Group 2015-06-25 /pmc/articles/PMC4479802/ /pubmed/26108279 http://dx.doi.org/10.1038/srep11565 Text en Copyright © 2015, Macmillan Publishers Limited http://creativecommons.org/licenses/by/4.0/ This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ |
spellingShingle | Article Deorowicz, Sebastian Danek, Agnieszka Niemiec, Marcin GDC 2: Compression of large collections of genomes |
title | GDC 2: Compression of large collections of genomes |
title_full | GDC 2: Compression of large collections of genomes |
title_fullStr | GDC 2: Compression of large collections of genomes |
title_full_unstemmed | GDC 2: Compression of large collections of genomes |
title_short | GDC 2: Compression of large collections of genomes |
title_sort | gdc 2: compression of large collections of genomes |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4479802/ https://www.ncbi.nlm.nih.gov/pubmed/26108279 http://dx.doi.org/10.1038/srep11565 |
work_keys_str_mv | AT deorowiczsebastian gdc2compressionoflargecollectionsofgenomes AT danekagnieszka gdc2compressionoflargecollectionsofgenomes AT niemiecmarcin gdc2compressionoflargecollectionsofgenomes |