Cargando…
A Reference-Free Lossless Compression Algorithm for DNA Sequences Using a Competitive Prediction of Two Classes of Weighted Models
The development of efficient data compressors for DNA sequences is crucial not only for reducing the storage and the bandwidth for transmission, but also for analysis purposes. In particular, the development of improved compression models directly influences the outcome of anthropological and biomed...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
MDPI
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7514418/ http://dx.doi.org/10.3390/e21111074 |
_version_ | 1783586583594663936 |
---|---|
author | Pratas, Diogo Hosseini, Morteza Silva, Jorge M. Pinho, Armando J. |
author_facet | Pratas, Diogo Hosseini, Morteza Silva, Jorge M. Pinho, Armando J. |
author_sort | Pratas, Diogo |
collection | PubMed |
description | The development of efficient data compressors for DNA sequences is crucial not only for reducing the storage and the bandwidth for transmission, but also for analysis purposes. In particular, the development of improved compression models directly influences the outcome of anthropological and biomedical compression-based methods. In this paper, we describe a new lossless compressor with improved compression capabilities for DNA sequences representing different domains and kingdoms. The reference-free method uses a competitive prediction model to estimate, for each symbol, the best class of models to be used before applying arithmetic encoding. There are two classes of models: weighted context models (including substitutional tolerant context models) and weighted stochastic repeat models. Both classes of models use specific sub-programs to handle inverted repeats efficiently. The results show that the proposed method attains a higher compression ratio than state-of-the-art approaches, on a balanced and diverse benchmark, using a competitive level of computational resources. An efficient implementation of the method is publicly available, under the GPLv3 license. |
format | Online Article Text |
id | pubmed-7514418 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2019 |
publisher | MDPI |
record_format | MEDLINE/PubMed |
spelling | pubmed-75144182020-11-09 A Reference-Free Lossless Compression Algorithm for DNA Sequences Using a Competitive Prediction of Two Classes of Weighted Models Pratas, Diogo Hosseini, Morteza Silva, Jorge M. Pinho, Armando J. Entropy (Basel) Article The development of efficient data compressors for DNA sequences is crucial not only for reducing the storage and the bandwidth for transmission, but also for analysis purposes. In particular, the development of improved compression models directly influences the outcome of anthropological and biomedical compression-based methods. In this paper, we describe a new lossless compressor with improved compression capabilities for DNA sequences representing different domains and kingdoms. The reference-free method uses a competitive prediction model to estimate, for each symbol, the best class of models to be used before applying arithmetic encoding. There are two classes of models: weighted context models (including substitutional tolerant context models) and weighted stochastic repeat models. Both classes of models use specific sub-programs to handle inverted repeats efficiently. The results show that the proposed method attains a higher compression ratio than state-of-the-art approaches, on a balanced and diverse benchmark, using a competitive level of computational resources. An efficient implementation of the method is publicly available, under the GPLv3 license. MDPI 2019-11-02 /pmc/articles/PMC7514418/ http://dx.doi.org/10.3390/e21111074 Text en © 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Article Pratas, Diogo Hosseini, Morteza Silva, Jorge M. Pinho, Armando J. A Reference-Free Lossless Compression Algorithm for DNA Sequences Using a Competitive Prediction of Two Classes of Weighted Models |
title | A Reference-Free Lossless Compression Algorithm for DNA Sequences Using a Competitive Prediction of Two Classes of Weighted Models |
title_full | A Reference-Free Lossless Compression Algorithm for DNA Sequences Using a Competitive Prediction of Two Classes of Weighted Models |
title_fullStr | A Reference-Free Lossless Compression Algorithm for DNA Sequences Using a Competitive Prediction of Two Classes of Weighted Models |
title_full_unstemmed | A Reference-Free Lossless Compression Algorithm for DNA Sequences Using a Competitive Prediction of Two Classes of Weighted Models |
title_short | A Reference-Free Lossless Compression Algorithm for DNA Sequences Using a Competitive Prediction of Two Classes of Weighted Models |
title_sort | reference-free lossless compression algorithm for dna sequences using a competitive prediction of two classes of weighted models |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7514418/ http://dx.doi.org/10.3390/e21111074 |
work_keys_str_mv | AT pratasdiogo areferencefreelosslesscompressionalgorithmfordnasequencesusingacompetitivepredictionoftwoclassesofweightedmodels AT hosseinimorteza areferencefreelosslesscompressionalgorithmfordnasequencesusingacompetitivepredictionoftwoclassesofweightedmodels AT silvajorgem areferencefreelosslesscompressionalgorithmfordnasequencesusingacompetitivepredictionoftwoclassesofweightedmodels AT pinhoarmandoj areferencefreelosslesscompressionalgorithmfordnasequencesusingacompetitivepredictionoftwoclassesofweightedmodels AT pratasdiogo referencefreelosslesscompressionalgorithmfordnasequencesusingacompetitivepredictionoftwoclassesofweightedmodels AT hosseinimorteza referencefreelosslesscompressionalgorithmfordnasequencesusingacompetitivepredictionoftwoclassesofweightedmodels AT silvajorgem referencefreelosslesscompressionalgorithmfordnasequencesusingacompetitivepredictionoftwoclassesofweightedmodels AT pinhoarmandoj referencefreelosslesscompressionalgorithmfordnasequencesusingacompetitivepredictionoftwoclassesofweightedmodels |