Cargando…

Predicting the animal hosts of coronaviruses from compositional biases of spike protein and whole genome sequences through machine learning

The COVID-19 pandemic has demonstrated the serious potential for novel zoonotic coronaviruses to emerge and cause major outbreaks. The immediate animal origin of the causative virus, SARS-CoV-2, remains unknown, a notoriously challenging task for emerging disease investigations. Coevolution with hos...

Descripción completa

Detalles Bibliográficos
Autores principales: Brierley, Liam, Fowler, Anna
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8087038/
https://www.ncbi.nlm.nih.gov/pubmed/33878118
http://dx.doi.org/10.1371/journal.ppat.1009149
_version_ 1783686606242185216
author Brierley, Liam
Fowler, Anna
author_facet Brierley, Liam
Fowler, Anna
author_sort Brierley, Liam
collection PubMed
description The COVID-19 pandemic has demonstrated the serious potential for novel zoonotic coronaviruses to emerge and cause major outbreaks. The immediate animal origin of the causative virus, SARS-CoV-2, remains unknown, a notoriously challenging task for emerging disease investigations. Coevolution with hosts leads to specific evolutionary signatures within viral genomes that can inform likely animal origins. We obtained a set of 650 spike protein and 511 whole genome nucleotide sequences from 222 and 185 viruses belonging to the family Coronaviridae, respectively. We then trained random forest models independently on genome composition biases of spike protein and whole genome sequences, including dinucleotide and codon usage biases in order to predict animal host (of nine possible categories, including human). In hold-one-out cross-validation, predictive accuracy on unseen coronaviruses consistently reached ~73%, indicating evolutionary signal in spike proteins to be just as informative as whole genome sequences. However, different composition biases were informative in each case. Applying optimised random forest models to classify human sequences of MERS-CoV and SARS-CoV revealed evolutionary signatures consistent with their recognised intermediate hosts (camelids, carnivores), while human sequences of SARS-CoV-2 were predicted as having bat hosts (suborder Yinpterochiroptera), supporting bats as the suspected origins of the current pandemic. In addition to phylogeny, variation in genome composition can act as an informative approach to predict emerging virus traits as soon as sequences are available. More widely, this work demonstrates the potential in combining genetic resources with machine learning algorithms to address long-standing challenges in emerging infectious diseases.
format Online
Article
Text
id pubmed-8087038
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-80870382021-05-06 Predicting the animal hosts of coronaviruses from compositional biases of spike protein and whole genome sequences through machine learning Brierley, Liam Fowler, Anna PLoS Pathog Research Article The COVID-19 pandemic has demonstrated the serious potential for novel zoonotic coronaviruses to emerge and cause major outbreaks. The immediate animal origin of the causative virus, SARS-CoV-2, remains unknown, a notoriously challenging task for emerging disease investigations. Coevolution with hosts leads to specific evolutionary signatures within viral genomes that can inform likely animal origins. We obtained a set of 650 spike protein and 511 whole genome nucleotide sequences from 222 and 185 viruses belonging to the family Coronaviridae, respectively. We then trained random forest models independently on genome composition biases of spike protein and whole genome sequences, including dinucleotide and codon usage biases in order to predict animal host (of nine possible categories, including human). In hold-one-out cross-validation, predictive accuracy on unseen coronaviruses consistently reached ~73%, indicating evolutionary signal in spike proteins to be just as informative as whole genome sequences. However, different composition biases were informative in each case. Applying optimised random forest models to classify human sequences of MERS-CoV and SARS-CoV revealed evolutionary signatures consistent with their recognised intermediate hosts (camelids, carnivores), while human sequences of SARS-CoV-2 were predicted as having bat hosts (suborder Yinpterochiroptera), supporting bats as the suspected origins of the current pandemic. In addition to phylogeny, variation in genome composition can act as an informative approach to predict emerging virus traits as soon as sequences are available. More widely, this work demonstrates the potential in combining genetic resources with machine learning algorithms to address long-standing challenges in emerging infectious diseases. Public Library of Science 2021-04-20 /pmc/articles/PMC8087038/ /pubmed/33878118 http://dx.doi.org/10.1371/journal.ppat.1009149 Text en © 2021 Brierley, Fowler https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Brierley, Liam
Fowler, Anna
Predicting the animal hosts of coronaviruses from compositional biases of spike protein and whole genome sequences through machine learning
title Predicting the animal hosts of coronaviruses from compositional biases of spike protein and whole genome sequences through machine learning
title_full Predicting the animal hosts of coronaviruses from compositional biases of spike protein and whole genome sequences through machine learning
title_fullStr Predicting the animal hosts of coronaviruses from compositional biases of spike protein and whole genome sequences through machine learning
title_full_unstemmed Predicting the animal hosts of coronaviruses from compositional biases of spike protein and whole genome sequences through machine learning
title_short Predicting the animal hosts of coronaviruses from compositional biases of spike protein and whole genome sequences through machine learning
title_sort predicting the animal hosts of coronaviruses from compositional biases of spike protein and whole genome sequences through machine learning
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8087038/
https://www.ncbi.nlm.nih.gov/pubmed/33878118
http://dx.doi.org/10.1371/journal.ppat.1009149
work_keys_str_mv AT brierleyliam predictingtheanimalhostsofcoronavirusesfromcompositionalbiasesofspikeproteinandwholegenomesequencesthroughmachinelearning
AT fowleranna predictingtheanimalhostsofcoronavirusesfromcompositionalbiasesofspikeproteinandwholegenomesequencesthroughmachinelearning