Cargando…

A Reference-Free Lossless Compression Algorithm for DNA Sequences Using a Competitive Prediction of Two Classes of Weighted Models

The development of efficient data compressors for DNA sequences is crucial not only for reducing the storage and the bandwidth for transmission, but also for analysis purposes. In particular, the development of improved compression models directly influences the outcome of anthropological and biomed...

Descripción completa

Detalles Bibliográficos
Autores principales: Pratas, Diogo, Hosseini, Morteza, Silva, Jorge M., Pinho, Armando J.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7514418/
http://dx.doi.org/10.3390/e21111074
_version_ 1783586583594663936
author Pratas, Diogo
Hosseini, Morteza
Silva, Jorge M.
Pinho, Armando J.
author_facet Pratas, Diogo
Hosseini, Morteza
Silva, Jorge M.
Pinho, Armando J.
author_sort Pratas, Diogo
collection PubMed
description The development of efficient data compressors for DNA sequences is crucial not only for reducing the storage and the bandwidth for transmission, but also for analysis purposes. In particular, the development of improved compression models directly influences the outcome of anthropological and biomedical compression-based methods. In this paper, we describe a new lossless compressor with improved compression capabilities for DNA sequences representing different domains and kingdoms. The reference-free method uses a competitive prediction model to estimate, for each symbol, the best class of models to be used before applying arithmetic encoding. There are two classes of models: weighted context models (including substitutional tolerant context models) and weighted stochastic repeat models. Both classes of models use specific sub-programs to handle inverted repeats efficiently. The results show that the proposed method attains a higher compression ratio than state-of-the-art approaches, on a balanced and diverse benchmark, using a competitive level of computational resources. An efficient implementation of the method is publicly available, under the GPLv3 license.
format Online
Article
Text
id pubmed-7514418
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-75144182020-11-09 A Reference-Free Lossless Compression Algorithm for DNA Sequences Using a Competitive Prediction of Two Classes of Weighted Models Pratas, Diogo Hosseini, Morteza Silva, Jorge M. Pinho, Armando J. Entropy (Basel) Article The development of efficient data compressors for DNA sequences is crucial not only for reducing the storage and the bandwidth for transmission, but also for analysis purposes. In particular, the development of improved compression models directly influences the outcome of anthropological and biomedical compression-based methods. In this paper, we describe a new lossless compressor with improved compression capabilities for DNA sequences representing different domains and kingdoms. The reference-free method uses a competitive prediction model to estimate, for each symbol, the best class of models to be used before applying arithmetic encoding. There are two classes of models: weighted context models (including substitutional tolerant context models) and weighted stochastic repeat models. Both classes of models use specific sub-programs to handle inverted repeats efficiently. The results show that the proposed method attains a higher compression ratio than state-of-the-art approaches, on a balanced and diverse benchmark, using a competitive level of computational resources. An efficient implementation of the method is publicly available, under the GPLv3 license. MDPI 2019-11-02 /pmc/articles/PMC7514418/ http://dx.doi.org/10.3390/e21111074 Text en © 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Pratas, Diogo
Hosseini, Morteza
Silva, Jorge M.
Pinho, Armando J.
A Reference-Free Lossless Compression Algorithm for DNA Sequences Using a Competitive Prediction of Two Classes of Weighted Models
title A Reference-Free Lossless Compression Algorithm for DNA Sequences Using a Competitive Prediction of Two Classes of Weighted Models
title_full A Reference-Free Lossless Compression Algorithm for DNA Sequences Using a Competitive Prediction of Two Classes of Weighted Models
title_fullStr A Reference-Free Lossless Compression Algorithm for DNA Sequences Using a Competitive Prediction of Two Classes of Weighted Models
title_full_unstemmed A Reference-Free Lossless Compression Algorithm for DNA Sequences Using a Competitive Prediction of Two Classes of Weighted Models
title_short A Reference-Free Lossless Compression Algorithm for DNA Sequences Using a Competitive Prediction of Two Classes of Weighted Models
title_sort reference-free lossless compression algorithm for dna sequences using a competitive prediction of two classes of weighted models
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7514418/
http://dx.doi.org/10.3390/e21111074
work_keys_str_mv AT pratasdiogo areferencefreelosslesscompressionalgorithmfordnasequencesusingacompetitivepredictionoftwoclassesofweightedmodels
AT hosseinimorteza areferencefreelosslesscompressionalgorithmfordnasequencesusingacompetitivepredictionoftwoclassesofweightedmodels
AT silvajorgem areferencefreelosslesscompressionalgorithmfordnasequencesusingacompetitivepredictionoftwoclassesofweightedmodels
AT pinhoarmandoj areferencefreelosslesscompressionalgorithmfordnasequencesusingacompetitivepredictionoftwoclassesofweightedmodels
AT pratasdiogo referencefreelosslesscompressionalgorithmfordnasequencesusingacompetitivepredictionoftwoclassesofweightedmodels
AT hosseinimorteza referencefreelosslesscompressionalgorithmfordnasequencesusingacompetitivepredictionoftwoclassesofweightedmodels
AT silvajorgem referencefreelosslesscompressionalgorithmfordnasequencesusingacompetitivepredictionoftwoclassesofweightedmodels
AT pinhoarmandoj referencefreelosslesscompressionalgorithmfordnasequencesusingacompetitivepredictionoftwoclassesofweightedmodels