Cargando…

A novel compression tool for efficient storage of genome resequencing data

With the advent of DNA sequencing technologies, more and more reference genome sequences are available for many organisms. Analyzing sequence variation and understanding its biological importance are becoming a major research aim. However, how to store and process the huge amount of eukaryotic genom...

Descripción completa

Detalles Bibliográficos
Autores principales: Wang, Congmao, Zhang, Dabing
Formato: Texto
Lenguaje:English
Publicado: Oxford University Press 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074166/
https://www.ncbi.nlm.nih.gov/pubmed/21266471
http://dx.doi.org/10.1093/nar/gkr009
_version_ 1782201699426893824
author Wang, Congmao
Zhang, Dabing
author_facet Wang, Congmao
Zhang, Dabing
author_sort Wang, Congmao
collection PubMed
description With the advent of DNA sequencing technologies, more and more reference genome sequences are available for many organisms. Analyzing sequence variation and understanding its biological importance are becoming a major research aim. However, how to store and process the huge amount of eukaryotic genome data, such as those of the human, mouse and rice, has become a challenge to biologists. Currently available bioinformatics tools used to compress genome sequence data have some limitations, such as the requirement of the reference single nucleotide polymorphisms (SNPs) map and information on deletions and insertions. Here, we present a novel compression tool for storing and analyzing Genome ReSequencing data, named GRS. GRS is able to process the genome sequence data without the use of the reference SNPs and other sequence variation information and automatically rebuild the individual genome sequence data using the reference genome sequence. When its performance was tested on the first Korean personal genome sequence data set, GRS was able to achieve ∼159-fold compression, reducing the size of the data from 2986.8 to 18.8 MB. While being tested against the sequencing data from rice and Arabidopsis thaliana, GRS compressed the 361.0 MB rice genome data to 4.4 MB, and the A. thaliana genome data from 115.1 MB to 6.5 KB. This de novo compression tool is available at http://gmdd.shgmo.org/Computational-Biology/GRS.
format Text
id pubmed-3074166
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-30741662011-04-12 A novel compression tool for efficient storage of genome resequencing data Wang, Congmao Zhang, Dabing Nucleic Acids Res Methods Online With the advent of DNA sequencing technologies, more and more reference genome sequences are available for many organisms. Analyzing sequence variation and understanding its biological importance are becoming a major research aim. However, how to store and process the huge amount of eukaryotic genome data, such as those of the human, mouse and rice, has become a challenge to biologists. Currently available bioinformatics tools used to compress genome sequence data have some limitations, such as the requirement of the reference single nucleotide polymorphisms (SNPs) map and information on deletions and insertions. Here, we present a novel compression tool for storing and analyzing Genome ReSequencing data, named GRS. GRS is able to process the genome sequence data without the use of the reference SNPs and other sequence variation information and automatically rebuild the individual genome sequence data using the reference genome sequence. When its performance was tested on the first Korean personal genome sequence data set, GRS was able to achieve ∼159-fold compression, reducing the size of the data from 2986.8 to 18.8 MB. While being tested against the sequencing data from rice and Arabidopsis thaliana, GRS compressed the 361.0 MB rice genome data to 4.4 MB, and the A. thaliana genome data from 115.1 MB to 6.5 KB. This de novo compression tool is available at http://gmdd.shgmo.org/Computational-Biology/GRS. Oxford University Press 2011-04 2011-01-25 /pmc/articles/PMC3074166/ /pubmed/21266471 http://dx.doi.org/10.1093/nar/gkr009 Text en © The Author(s) 2011. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/2.5 This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methods Online
Wang, Congmao
Zhang, Dabing
A novel compression tool for efficient storage of genome resequencing data
title A novel compression tool for efficient storage of genome resequencing data
title_full A novel compression tool for efficient storage of genome resequencing data
title_fullStr A novel compression tool for efficient storage of genome resequencing data
title_full_unstemmed A novel compression tool for efficient storage of genome resequencing data
title_short A novel compression tool for efficient storage of genome resequencing data
title_sort novel compression tool for efficient storage of genome resequencing data
topic Methods Online
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074166/
https://www.ncbi.nlm.nih.gov/pubmed/21266471
http://dx.doi.org/10.1093/nar/gkr009
work_keys_str_mv AT wangcongmao anovelcompressiontoolforefficientstorageofgenomeresequencingdata
AT zhangdabing anovelcompressiontoolforefficientstorageofgenomeresequencingdata
AT wangcongmao novelcompressiontoolforefficientstorageofgenomeresequencingdata
AT zhangdabing novelcompressiontoolforefficientstorageofgenomeresequencingdata