Cargando…

Compression-Complexity Measures for Analysis and Classification of Coronaviruses

Finding a vaccine or specific antiviral treatment for a global pandemic of virus diseases (such as the ongoing COVID-19) requires rapid analysis, annotation and evaluation of metagenomic libraries to enable a quick and efficient screening of nucleotide sequences. Traditional sequence alignment metho...

Descripción completa

Detalles Bibliográficos
Autores principales: Munagala, Naga Venkata Trinath Sai, Amanchi, Prem Kumar, Balasubramanian, Karthi, Panicker, Athira, Nagaraj, Nithin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9857615/
https://www.ncbi.nlm.nih.gov/pubmed/36673224
http://dx.doi.org/10.3390/e25010081
_version_ 1784873908727447552
author Munagala, Naga Venkata Trinath Sai
Amanchi, Prem Kumar
Balasubramanian, Karthi
Panicker, Athira
Nagaraj, Nithin
author_facet Munagala, Naga Venkata Trinath Sai
Amanchi, Prem Kumar
Balasubramanian, Karthi
Panicker, Athira
Nagaraj, Nithin
author_sort Munagala, Naga Venkata Trinath Sai
collection PubMed
description Finding a vaccine or specific antiviral treatment for a global pandemic of virus diseases (such as the ongoing COVID-19) requires rapid analysis, annotation and evaluation of metagenomic libraries to enable a quick and efficient screening of nucleotide sequences. Traditional sequence alignment methods are not suitable and there is a need for fast alignment-free techniques for sequence analysis. Information theory and data compression algorithms provide a rich set of mathematical and computational tools to capture essential patterns in biological sequences. In this study, we investigate the use of compression-complexity (Effort-to-Compress or ETC and Lempel-Ziv or LZ complexity) based distance measures for analyzing genomic sequences. The proposed distance measure is used to successfully reproduce the phylogenetic trees for a mammalian dataset consisting of eight species clusters, a set of coronaviruses belonging to group I, group II, group III, and SARS-CoV-1 coronaviruses, and a set of coronaviruses causing COVID-19 (SARS-CoV-2), and those not causing COVID-19. Having demonstrated the usefulness of these compression complexity measures, we employ them for the automatic classification of COVID-19-causing genome sequences using machine learning techniques. Two flavors of SVM (linear and quadratic) along with linear discriminant and fine K Nearest Neighbors classifer are used for classification. Using a data set comprising 1001 coronavirus sequences (causing COVID-19 and those not causing COVID-19), a classification accuracy of 98% is achieved with a sensitivity of 95% and a specificity of 99.8%. This work could be extended further to enable medical practitioners to automatically identify and characterize coronavirus strains and their rapidly growing mutants in a fast and efficient fashion.
format Online
Article
Text
id pubmed-9857615
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-98576152023-01-21 Compression-Complexity Measures for Analysis and Classification of Coronaviruses Munagala, Naga Venkata Trinath Sai Amanchi, Prem Kumar Balasubramanian, Karthi Panicker, Athira Nagaraj, Nithin Entropy (Basel) Article Finding a vaccine or specific antiviral treatment for a global pandemic of virus diseases (such as the ongoing COVID-19) requires rapid analysis, annotation and evaluation of metagenomic libraries to enable a quick and efficient screening of nucleotide sequences. Traditional sequence alignment methods are not suitable and there is a need for fast alignment-free techniques for sequence analysis. Information theory and data compression algorithms provide a rich set of mathematical and computational tools to capture essential patterns in biological sequences. In this study, we investigate the use of compression-complexity (Effort-to-Compress or ETC and Lempel-Ziv or LZ complexity) based distance measures for analyzing genomic sequences. The proposed distance measure is used to successfully reproduce the phylogenetic trees for a mammalian dataset consisting of eight species clusters, a set of coronaviruses belonging to group I, group II, group III, and SARS-CoV-1 coronaviruses, and a set of coronaviruses causing COVID-19 (SARS-CoV-2), and those not causing COVID-19. Having demonstrated the usefulness of these compression complexity measures, we employ them for the automatic classification of COVID-19-causing genome sequences using machine learning techniques. Two flavors of SVM (linear and quadratic) along with linear discriminant and fine K Nearest Neighbors classifer are used for classification. Using a data set comprising 1001 coronavirus sequences (causing COVID-19 and those not causing COVID-19), a classification accuracy of 98% is achieved with a sensitivity of 95% and a specificity of 99.8%. This work could be extended further to enable medical practitioners to automatically identify and characterize coronavirus strains and their rapidly growing mutants in a fast and efficient fashion. MDPI 2022-12-31 /pmc/articles/PMC9857615/ /pubmed/36673224 http://dx.doi.org/10.3390/e25010081 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Munagala, Naga Venkata Trinath Sai
Amanchi, Prem Kumar
Balasubramanian, Karthi
Panicker, Athira
Nagaraj, Nithin
Compression-Complexity Measures for Analysis and Classification of Coronaviruses
title Compression-Complexity Measures for Analysis and Classification of Coronaviruses
title_full Compression-Complexity Measures for Analysis and Classification of Coronaviruses
title_fullStr Compression-Complexity Measures for Analysis and Classification of Coronaviruses
title_full_unstemmed Compression-Complexity Measures for Analysis and Classification of Coronaviruses
title_short Compression-Complexity Measures for Analysis and Classification of Coronaviruses
title_sort compression-complexity measures for analysis and classification of coronaviruses
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9857615/
https://www.ncbi.nlm.nih.gov/pubmed/36673224
http://dx.doi.org/10.3390/e25010081
work_keys_str_mv AT munagalanagavenkatatrinathsai compressioncomplexitymeasuresforanalysisandclassificationofcoronaviruses
AT amanchipremkumar compressioncomplexitymeasuresforanalysisandclassificationofcoronaviruses
AT balasubramaniankarthi compressioncomplexitymeasuresforanalysisandclassificationofcoronaviruses
AT panickerathira compressioncomplexitymeasuresforanalysisandclassificationofcoronaviruses
AT nagarajnithin compressioncomplexitymeasuresforanalysisandclassificationofcoronaviruses