Cargando…

GDC 2: Compression of large collections of genomes

The fall of prices of the high-throughput genome sequencing changes the landscape of modern genomics. A number of large scale projects aimed at sequencing many human genomes are in progress. Genome sequencing also becomes an important aid in the personalized medicine. One of the significant side eff...

Descripción completa

Detalles Bibliográficos
Autores principales:	Deorowicz, Sebastian, Danek, Agnieszka, Niemiec, Marcin
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Nature Publishing Group 2015
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4479802/ https://www.ncbi.nlm.nih.gov/pubmed/26108279 http://dx.doi.org/10.1038/srep11565

_version_	1782378063176138752
author	Deorowicz, Sebastian Danek, Agnieszka Niemiec, Marcin
author_facet	Deorowicz, Sebastian Danek, Agnieszka Niemiec, Marcin
author_sort	Deorowicz, Sebastian
collection	PubMed
description	The fall of prices of the high-throughput genome sequencing changes the landscape of modern genomics. A number of large scale projects aimed at sequencing many human genomes are in progress. Genome sequencing also becomes an important aid in the personalized medicine. One of the significant side effects of this change is a necessity of storage and transfer of huge amounts of genomic data. In this paper we deal with the problem of compression of large collections of complete genomic sequences. We propose an algorithm that is able to compress the collection of 1092 human diploid genomes about 9,500 times. This result is about 4 times better than what is offered by the other existing compressors. Moreover, our algorithm is very fast as it processes the data with speed 200 MB/s on a modern workstation. In a consequence the proposed algorithm allows storing the complete genomic collections at low cost, e.g., the examined collection of 1092 human genomes needs only about 700 MB when compressed, what can be compared to about 6.7 TB of uncompressed FASTA files. The source code is available at http://sun.aei.polsl.pl/REFRESH/index.php?page=projects&project=gdc&subpage=about.
format	Online Article Text
id	pubmed-4479802
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	Nature Publishing Group
record_format	MEDLINE/PubMed
spelling	pubmed-44798022015-06-29 GDC 2: Compression of large collections of genomes Deorowicz, Sebastian Danek, Agnieszka Niemiec, Marcin Sci Rep Article The fall of prices of the high-throughput genome sequencing changes the landscape of modern genomics. A number of large scale projects aimed at sequencing many human genomes are in progress. Genome sequencing also becomes an important aid in the personalized medicine. One of the significant side effects of this change is a necessity of storage and transfer of huge amounts of genomic data. In this paper we deal with the problem of compression of large collections of complete genomic sequences. We propose an algorithm that is able to compress the collection of 1092 human diploid genomes about 9,500 times. This result is about 4 times better than what is offered by the other existing compressors. Moreover, our algorithm is very fast as it processes the data with speed 200 MB/s on a modern workstation. In a consequence the proposed algorithm allows storing the complete genomic collections at low cost, e.g., the examined collection of 1092 human genomes needs only about 700 MB when compressed, what can be compared to about 6.7 TB of uncompressed FASTA files. The source code is available at http://sun.aei.polsl.pl/REFRESH/index.php?page=projects&project=gdc&subpage=about. Nature Publishing Group 2015-06-25 /pmc/articles/PMC4479802/ /pubmed/26108279 http://dx.doi.org/10.1038/srep11565 Text en Copyright © 2015, Macmillan Publishers Limited http://creativecommons.org/licenses/by/4.0/ This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/
spellingShingle	Article Deorowicz, Sebastian Danek, Agnieszka Niemiec, Marcin GDC 2: Compression of large collections of genomes
title	GDC 2: Compression of large collections of genomes
title_full	GDC 2: Compression of large collections of genomes
title_fullStr	GDC 2: Compression of large collections of genomes
title_full_unstemmed	GDC 2: Compression of large collections of genomes
title_short	GDC 2: Compression of large collections of genomes
title_sort	gdc 2: compression of large collections of genomes
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4479802/ https://www.ncbi.nlm.nih.gov/pubmed/26108279 http://dx.doi.org/10.1038/srep11565
work_keys_str_mv	AT deorowiczsebastian gdc2compressionoflargecollectionsofgenomes AT danekagnieszka gdc2compressionoflargecollectionsofgenomes AT niemiecmarcin gdc2compressionoflargecollectionsofgenomes

GDC 2: Compression of large collections of genomes

Ejemplares similares