Cargando…

CHAPAO: Likelihood and hierarchical reference-based representation of biomolecular sequences and applications to compressing multiple sequence alignments

BACKGROUND: High-throughput experimental technologies are generating tremendous amounts of genomic data, offering valuable resources to answer important questions and extract biological insights. Storing this sheer amount of genomic data has become a major concern in bioinformatics. General purpose...

Descripción completa

Detalles Bibliográficos
Autores principales: Rahman, Md Ashiqur, Tutul, Abdullah Aman, Abdullah, Sifat Muhammad, Bayzid, Md. Shamsuzzoha
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9015123/
https://www.ncbi.nlm.nih.gov/pubmed/35436292
http://dx.doi.org/10.1371/journal.pone.0265360
_version_ 1784688320636256256
author Rahman, Md Ashiqur
Tutul, Abdullah Aman
Abdullah, Sifat Muhammad
Bayzid, Md. Shamsuzzoha
author_facet Rahman, Md Ashiqur
Tutul, Abdullah Aman
Abdullah, Sifat Muhammad
Bayzid, Md. Shamsuzzoha
author_sort Rahman, Md Ashiqur
collection PubMed
description BACKGROUND: High-throughput experimental technologies are generating tremendous amounts of genomic data, offering valuable resources to answer important questions and extract biological insights. Storing this sheer amount of genomic data has become a major concern in bioinformatics. General purpose compression techniques (e.g. gzip, bzip2, 7-zip) are being widely used due to their pervasiveness and relatively good speed. However, they are not customized for genomic data and may fail to leverage special characteristics and redundancy of the biomolecular sequences. RESULTS: We present a new lossless compression method CHAPAO (COmpressing Alignments using Hierarchical and Probabilistic Approach), which is especially designed for multiple sequence alignments (MSAs) of biomolecular data and offers very good compression gain. We have introduced a novel hierarchical referencing technique to represent biomolecular sequences which combines likelihood based analyses of the sequence similarities and graph theoretic algorithms. We performed an extensive evaluation study using a collection of real biological data from the avian phylogenomics project, 1000 plants project (1KP), and 16S and 23S rRNA datasets. We report the performance of CHAPAO in comparison with general purpose compression techniques as well as with MFCompress and Nucleotide Archival Format (NAF)—two of the best known methods especially designed for FASTA files. Experimental results suggest that CHAPAO offers significant improvements in compression gain over most other alternative methods. CHAPAO is freely available as an open source software at https://github.com/ashiq24/CHAPAO. CONCLUSION: CHAPAO advances the state-of-the-art in compression algorithms and represents a potential alternative to the general purpose compression techniques as well as to the existing specialized compression techniques for biomolecular sequences.
format Online
Article
Text
id pubmed-9015123
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-90151232022-04-19 CHAPAO: Likelihood and hierarchical reference-based representation of biomolecular sequences and applications to compressing multiple sequence alignments Rahman, Md Ashiqur Tutul, Abdullah Aman Abdullah, Sifat Muhammad Bayzid, Md. Shamsuzzoha PLoS One Research Article BACKGROUND: High-throughput experimental technologies are generating tremendous amounts of genomic data, offering valuable resources to answer important questions and extract biological insights. Storing this sheer amount of genomic data has become a major concern in bioinformatics. General purpose compression techniques (e.g. gzip, bzip2, 7-zip) are being widely used due to their pervasiveness and relatively good speed. However, they are not customized for genomic data and may fail to leverage special characteristics and redundancy of the biomolecular sequences. RESULTS: We present a new lossless compression method CHAPAO (COmpressing Alignments using Hierarchical and Probabilistic Approach), which is especially designed for multiple sequence alignments (MSAs) of biomolecular data and offers very good compression gain. We have introduced a novel hierarchical referencing technique to represent biomolecular sequences which combines likelihood based analyses of the sequence similarities and graph theoretic algorithms. We performed an extensive evaluation study using a collection of real biological data from the avian phylogenomics project, 1000 plants project (1KP), and 16S and 23S rRNA datasets. We report the performance of CHAPAO in comparison with general purpose compression techniques as well as with MFCompress and Nucleotide Archival Format (NAF)—two of the best known methods especially designed for FASTA files. Experimental results suggest that CHAPAO offers significant improvements in compression gain over most other alternative methods. CHAPAO is freely available as an open source software at https://github.com/ashiq24/CHAPAO. CONCLUSION: CHAPAO advances the state-of-the-art in compression algorithms and represents a potential alternative to the general purpose compression techniques as well as to the existing specialized compression techniques for biomolecular sequences. Public Library of Science 2022-04-18 /pmc/articles/PMC9015123/ /pubmed/35436292 http://dx.doi.org/10.1371/journal.pone.0265360 Text en © 2022 Rahman et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Rahman, Md Ashiqur
Tutul, Abdullah Aman
Abdullah, Sifat Muhammad
Bayzid, Md. Shamsuzzoha
CHAPAO: Likelihood and hierarchical reference-based representation of biomolecular sequences and applications to compressing multiple sequence alignments
title CHAPAO: Likelihood and hierarchical reference-based representation of biomolecular sequences and applications to compressing multiple sequence alignments
title_full CHAPAO: Likelihood and hierarchical reference-based representation of biomolecular sequences and applications to compressing multiple sequence alignments
title_fullStr CHAPAO: Likelihood and hierarchical reference-based representation of biomolecular sequences and applications to compressing multiple sequence alignments
title_full_unstemmed CHAPAO: Likelihood and hierarchical reference-based representation of biomolecular sequences and applications to compressing multiple sequence alignments
title_short CHAPAO: Likelihood and hierarchical reference-based representation of biomolecular sequences and applications to compressing multiple sequence alignments
title_sort chapao: likelihood and hierarchical reference-based representation of biomolecular sequences and applications to compressing multiple sequence alignments
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9015123/
https://www.ncbi.nlm.nih.gov/pubmed/35436292
http://dx.doi.org/10.1371/journal.pone.0265360
work_keys_str_mv AT rahmanmdashiqur chapaolikelihoodandhierarchicalreferencebasedrepresentationofbiomolecularsequencesandapplicationstocompressingmultiplesequencealignments
AT tutulabdullahaman chapaolikelihoodandhierarchicalreferencebasedrepresentationofbiomolecularsequencesandapplicationstocompressingmultiplesequencealignments
AT abdullahsifatmuhammad chapaolikelihoodandhierarchicalreferencebasedrepresentationofbiomolecularsequencesandapplicationstocompressingmultiplesequencealignments
AT bayzidmdshamsuzzoha chapaolikelihoodandhierarchicalreferencebasedrepresentationofbiomolecularsequencesandapplicationstocompressingmultiplesequencealignments