Cargando…
AC2: An Efficient Protein Sequence Compression Tool Using Artificial Neural Networks and Cache-Hash Models
Recently, the scientific community has witnessed a substantial increase in the generation of protein sequence data, triggering emergent challenges of increasing importance, namely efficient storage and improved data analysis. For both applications, data compression is a straightforward solution. How...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
MDPI
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8146440/ https://www.ncbi.nlm.nih.gov/pubmed/33925812 http://dx.doi.org/10.3390/e23050530 |
_version_ | 1783697397471248384 |
---|---|
author | Silva, Milton Pratas, Diogo Pinho, Armando J. |
author_facet | Silva, Milton Pratas, Diogo Pinho, Armando J. |
author_sort | Silva, Milton |
collection | PubMed |
description | Recently, the scientific community has witnessed a substantial increase in the generation of protein sequence data, triggering emergent challenges of increasing importance, namely efficient storage and improved data analysis. For both applications, data compression is a straightforward solution. However, in the literature, the number of specific protein sequence compressors is relatively low. Moreover, these specialized compressors marginally improve the compression ratio over the best general-purpose compressors. In this paper, we present AC2, a new lossless data compressor for protein (or amino acid) sequences. AC2 uses a neural network to mix experts with a stacked generalization approach and individual cache-hash memory models to the highest-context orders. Compared to the previous compressor (AC), we show gains of 2–9% and 6–7% in reference-free and reference-based modes, respectively. These gains come at the cost of three times slower computations. AC2 also improves memory usage against AC, with requirements about seven times lower, without being affected by the sequences’ input size. As an analysis application, we use AC2 to measure the similarity between each SARS-CoV-2 protein sequence with each viral protein sequence from the whole UniProt database. The results consistently show higher similarity to the pangolin coronavirus, followed by the bat and human coronaviruses, contributing with critical results to a current controversial subject. AC2 is available for free download under GPLv3 license. |
format | Online Article Text |
id | pubmed-8146440 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | MDPI |
record_format | MEDLINE/PubMed |
spelling | pubmed-81464402021-05-26 AC2: An Efficient Protein Sequence Compression Tool Using Artificial Neural Networks and Cache-Hash Models Silva, Milton Pratas, Diogo Pinho, Armando J. Entropy (Basel) Article Recently, the scientific community has witnessed a substantial increase in the generation of protein sequence data, triggering emergent challenges of increasing importance, namely efficient storage and improved data analysis. For both applications, data compression is a straightforward solution. However, in the literature, the number of specific protein sequence compressors is relatively low. Moreover, these specialized compressors marginally improve the compression ratio over the best general-purpose compressors. In this paper, we present AC2, a new lossless data compressor for protein (or amino acid) sequences. AC2 uses a neural network to mix experts with a stacked generalization approach and individual cache-hash memory models to the highest-context orders. Compared to the previous compressor (AC), we show gains of 2–9% and 6–7% in reference-free and reference-based modes, respectively. These gains come at the cost of three times slower computations. AC2 also improves memory usage against AC, with requirements about seven times lower, without being affected by the sequences’ input size. As an analysis application, we use AC2 to measure the similarity between each SARS-CoV-2 protein sequence with each viral protein sequence from the whole UniProt database. The results consistently show higher similarity to the pangolin coronavirus, followed by the bat and human coronaviruses, contributing with critical results to a current controversial subject. AC2 is available for free download under GPLv3 license. MDPI 2021-04-26 /pmc/articles/PMC8146440/ /pubmed/33925812 http://dx.doi.org/10.3390/e23050530 Text en © 2021 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Article Silva, Milton Pratas, Diogo Pinho, Armando J. AC2: An Efficient Protein Sequence Compression Tool Using Artificial Neural Networks and Cache-Hash Models |
title | AC2: An Efficient Protein Sequence Compression Tool Using Artificial Neural Networks and Cache-Hash Models |
title_full | AC2: An Efficient Protein Sequence Compression Tool Using Artificial Neural Networks and Cache-Hash Models |
title_fullStr | AC2: An Efficient Protein Sequence Compression Tool Using Artificial Neural Networks and Cache-Hash Models |
title_full_unstemmed | AC2: An Efficient Protein Sequence Compression Tool Using Artificial Neural Networks and Cache-Hash Models |
title_short | AC2: An Efficient Protein Sequence Compression Tool Using Artificial Neural Networks and Cache-Hash Models |
title_sort | ac2: an efficient protein sequence compression tool using artificial neural networks and cache-hash models |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8146440/ https://www.ncbi.nlm.nih.gov/pubmed/33925812 http://dx.doi.org/10.3390/e23050530 |
work_keys_str_mv | AT silvamilton ac2anefficientproteinsequencecompressiontoolusingartificialneuralnetworksandcachehashmodels AT pratasdiogo ac2anefficientproteinsequencecompressiontoolusingartificialneuralnetworksandcachehashmodels AT pinhoarmandoj ac2anefficientproteinsequencecompressiontoolusingartificialneuralnetworksandcachehashmodels |