Cargando…

DNA-COMPACT: DNA COMpression Based on a Pattern-Aware Contextual Modeling Technique

Genome data are becoming increasingly important for modern medicine. As the rate of increase in DNA sequencing outstrips the rate of increase in disk storage capacity, the storage and data transferring of large genome data are becoming important concerns for biomedical researchers. We propose a two-...

Descripción completa

Detalles Bibliográficos
Autores principales: Li, Pinghao, Wang, Shuang, Kim, Jihoon, Xiong, Hongkai, Ohno-Machado, Lucila, Jiang, Xiaoqian
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3840021/
https://www.ncbi.nlm.nih.gov/pubmed/24282536
http://dx.doi.org/10.1371/journal.pone.0080377
_version_ 1782478470275661824
author Li, Pinghao
Wang, Shuang
Kim, Jihoon
Xiong, Hongkai
Ohno-Machado, Lucila
Jiang, Xiaoqian
author_facet Li, Pinghao
Wang, Shuang
Kim, Jihoon
Xiong, Hongkai
Ohno-Machado, Lucila
Jiang, Xiaoqian
author_sort Li, Pinghao
collection PubMed
description Genome data are becoming increasingly important for modern medicine. As the rate of increase in DNA sequencing outstrips the rate of increase in disk storage capacity, the storage and data transferring of large genome data are becoming important concerns for biomedical researchers. We propose a two-pass lossless genome compression algorithm, which highlights the synthesis of complementary contextual models, to improve the compression performance. The proposed framework could handle genome compression with and without reference sequences, and demonstrated performance advantages over best existing algorithms. The method for reference-free compression led to bit rates of 1.720 and 1.838 bits per base for bacteria and yeast, which were approximately 3.7% and 2.6% better than the state-of-the-art algorithms. Regarding performance with reference, we tested on the first Korean personal genome sequence data set, and our proposed method demonstrated a 189-fold compression rate, reducing the raw file size from 2986.8 MB to 15.8 MB at a comparable decompression cost with existing algorithms. DNAcompact is freely available at https://sourceforge.net/projects/dnacompact/for research purpose.
format Online
Article
Text
id pubmed-3840021
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-38400212013-11-26 DNA-COMPACT: DNA COMpression Based on a Pattern-Aware Contextual Modeling Technique Li, Pinghao Wang, Shuang Kim, Jihoon Xiong, Hongkai Ohno-Machado, Lucila Jiang, Xiaoqian PLoS One Research Article Genome data are becoming increasingly important for modern medicine. As the rate of increase in DNA sequencing outstrips the rate of increase in disk storage capacity, the storage and data transferring of large genome data are becoming important concerns for biomedical researchers. We propose a two-pass lossless genome compression algorithm, which highlights the synthesis of complementary contextual models, to improve the compression performance. The proposed framework could handle genome compression with and without reference sequences, and demonstrated performance advantages over best existing algorithms. The method for reference-free compression led to bit rates of 1.720 and 1.838 bits per base for bacteria and yeast, which were approximately 3.7% and 2.6% better than the state-of-the-art algorithms. Regarding performance with reference, we tested on the first Korean personal genome sequence data set, and our proposed method demonstrated a 189-fold compression rate, reducing the raw file size from 2986.8 MB to 15.8 MB at a comparable decompression cost with existing algorithms. DNAcompact is freely available at https://sourceforge.net/projects/dnacompact/for research purpose. Public Library of Science 2013-11-25 /pmc/articles/PMC3840021/ /pubmed/24282536 http://dx.doi.org/10.1371/journal.pone.0080377 Text en © 2013 Li et al http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Li, Pinghao
Wang, Shuang
Kim, Jihoon
Xiong, Hongkai
Ohno-Machado, Lucila
Jiang, Xiaoqian
DNA-COMPACT: DNA COMpression Based on a Pattern-Aware Contextual Modeling Technique
title DNA-COMPACT: DNA COMpression Based on a Pattern-Aware Contextual Modeling Technique
title_full DNA-COMPACT: DNA COMpression Based on a Pattern-Aware Contextual Modeling Technique
title_fullStr DNA-COMPACT: DNA COMpression Based on a Pattern-Aware Contextual Modeling Technique
title_full_unstemmed DNA-COMPACT: DNA COMpression Based on a Pattern-Aware Contextual Modeling Technique
title_short DNA-COMPACT: DNA COMpression Based on a Pattern-Aware Contextual Modeling Technique
title_sort dna-compact: dna compression based on a pattern-aware contextual modeling technique
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3840021/
https://www.ncbi.nlm.nih.gov/pubmed/24282536
http://dx.doi.org/10.1371/journal.pone.0080377
work_keys_str_mv AT lipinghao dnacompactdnacompressionbasedonapatternawarecontextualmodelingtechnique
AT wangshuang dnacompactdnacompressionbasedonapatternawarecontextualmodelingtechnique
AT kimjihoon dnacompactdnacompressionbasedonapatternawarecontextualmodelingtechnique
AT xionghongkai dnacompactdnacompressionbasedonapatternawarecontextualmodelingtechnique
AT ohnomachadolucila dnacompactdnacompressionbasedonapatternawarecontextualmodelingtechnique
AT jiangxiaoqian dnacompactdnacompressionbasedonapatternawarecontextualmodelingtechnique