Cargando…

Fast Phylogeny of SARS-CoV-2 by Compression

The compression method to assess similarity, in the sense of having a small normalized compression distance (NCD), was developed based on algorithmic information theory to quantify the similarity in files ranging from words and languages to genomes and music pieces. It has been validated on objects...

Descripción completa

Detalles Bibliográficos
Autores principales: Cilibrasi, Rudi L., Vitányi, Paul M. B.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9030035/
https://www.ncbi.nlm.nih.gov/pubmed/35455102
http://dx.doi.org/10.3390/e24040439
_version_ 1784692045966737408
author Cilibrasi, Rudi L.
Vitányi, Paul M. B.
author_facet Cilibrasi, Rudi L.
Vitányi, Paul M. B.
author_sort Cilibrasi, Rudi L.
collection PubMed
description The compression method to assess similarity, in the sense of having a small normalized compression distance (NCD), was developed based on algorithmic information theory to quantify the similarity in files ranging from words and languages to genomes and music pieces. It has been validated on objects from different domains always using essentially the same software. We analyze the whole-genome phylogeny and taxonomy of the SARS-CoV-2 virus, which is responsible for causing the COVID-19 disease, using the alignment-free compression method to assess similarity. We compare the SARS-CoV-2 virus with a database of over 6500 viruses. The results suggest that the SARS-CoV-2 virus is closest in that database to the RaTG13 virus and rather close to the bat SARS-like coronaviruses bat-SL-CoVZXC21 and bat-SL-CoVZC45. Over 6500 viruses are identified (given by their registration code) with larger NCDs. The NCDs are compared with the NCDs between the mtDNA of familiar species. We address the question of whether pangolins are involved in the SARS-CoV-2 virus. The compression method is simpler and possibly faster than any other whole-genome method, which makes it the ideal tool to explore phylogeny. Here, we use it for the complex case of determining this similarity between the COVID-19 virus, SARS-CoV-2 and many other viruses. The resulting phylogeny and taxonomy closely resemble earlier results from by alignment-based methods and a machine-learning method, providing the most compelling evidence to date for the compression method, showing that one can achieve equivalent results both simply and quickly.
format Online
Article
Text
id pubmed-9030035
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-90300352022-04-23 Fast Phylogeny of SARS-CoV-2 by Compression Cilibrasi, Rudi L. Vitányi, Paul M. B. Entropy (Basel) Article The compression method to assess similarity, in the sense of having a small normalized compression distance (NCD), was developed based on algorithmic information theory to quantify the similarity in files ranging from words and languages to genomes and music pieces. It has been validated on objects from different domains always using essentially the same software. We analyze the whole-genome phylogeny and taxonomy of the SARS-CoV-2 virus, which is responsible for causing the COVID-19 disease, using the alignment-free compression method to assess similarity. We compare the SARS-CoV-2 virus with a database of over 6500 viruses. The results suggest that the SARS-CoV-2 virus is closest in that database to the RaTG13 virus and rather close to the bat SARS-like coronaviruses bat-SL-CoVZXC21 and bat-SL-CoVZC45. Over 6500 viruses are identified (given by their registration code) with larger NCDs. The NCDs are compared with the NCDs between the mtDNA of familiar species. We address the question of whether pangolins are involved in the SARS-CoV-2 virus. The compression method is simpler and possibly faster than any other whole-genome method, which makes it the ideal tool to explore phylogeny. Here, we use it for the complex case of determining this similarity between the COVID-19 virus, SARS-CoV-2 and many other viruses. The resulting phylogeny and taxonomy closely resemble earlier results from by alignment-based methods and a machine-learning method, providing the most compelling evidence to date for the compression method, showing that one can achieve equivalent results both simply and quickly. MDPI 2022-03-22 /pmc/articles/PMC9030035/ /pubmed/35455102 http://dx.doi.org/10.3390/e24040439 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Cilibrasi, Rudi L.
Vitányi, Paul M. B.
Fast Phylogeny of SARS-CoV-2 by Compression
title Fast Phylogeny of SARS-CoV-2 by Compression
title_full Fast Phylogeny of SARS-CoV-2 by Compression
title_fullStr Fast Phylogeny of SARS-CoV-2 by Compression
title_full_unstemmed Fast Phylogeny of SARS-CoV-2 by Compression
title_short Fast Phylogeny of SARS-CoV-2 by Compression
title_sort fast phylogeny of sars-cov-2 by compression
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9030035/
https://www.ncbi.nlm.nih.gov/pubmed/35455102
http://dx.doi.org/10.3390/e24040439
work_keys_str_mv AT cilibrasirudil fastphylogenyofsarscov2bycompression
AT vitanyipaulmb fastphylogenyofsarscov2bycompression