Cargando…
Predicting host taxonomic information from viral genomes: A comparison of feature representations
The rise in metagenomics has led to an exponential growth in virus discovery. However, the majority of these new virus sequences have no assigned host. Current machine learning approaches to predicting virus host interactions have a tendency to focus on nucleotide features, ignoring other representa...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7307784/ https://www.ncbi.nlm.nih.gov/pubmed/32453718 http://dx.doi.org/10.1371/journal.pcbi.1007894 |
_version_ | 1783548870222938112 |
---|---|
author | Young, Francesca Rogers, Simon Robertson, David L. |
author_facet | Young, Francesca Rogers, Simon Robertson, David L. |
author_sort | Young, Francesca |
collection | PubMed |
description | The rise in metagenomics has led to an exponential growth in virus discovery. However, the majority of these new virus sequences have no assigned host. Current machine learning approaches to predicting virus host interactions have a tendency to focus on nucleotide features, ignoring other representations of genomic information. Here we investigate the predictive potential of features generated from four different ‘levels’ of viral genome representation: nucleotide, amino acid, amino acid properties and protein domains. This more fully exploits the biological information present in the virus genomes. Over a hundred and eighty binary datasets for infecting versus non-infecting viruses at all taxonomic ranks of both eukaryote and prokaryote hosts were compiled. The viral genomes were converted into the four different levels of genome representation and twenty feature sets were generated by extracting k-mer compositions and predicted protein domains. We trained and tested Support Vector Machine, SVM, classifiers to compare the predictive capacity of each of these feature sets for each dataset. Our results show that all levels of genome representation are consistently predictive of host taxonomy and that prediction k-mer composition improves with increasing k-mer length for all k-mer based features. Using a phylogenetically aware holdout method, we demonstrate that the predictive feature sets contain signals reflecting both the evolutionary relationship between the viruses infecting related hosts, and host-mimicry. Our results demonstrate that incorporating a range of complementary features, generated purely from virus genome sequences, leads to improved accuracy for a range of virus host prediction tasks enabling computational assignment of host taxonomic information. |
format | Online Article Text |
id | pubmed-7307784 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-73077842020-06-24 Predicting host taxonomic information from viral genomes: A comparison of feature representations Young, Francesca Rogers, Simon Robertson, David L. PLoS Comput Biol Research Article The rise in metagenomics has led to an exponential growth in virus discovery. However, the majority of these new virus sequences have no assigned host. Current machine learning approaches to predicting virus host interactions have a tendency to focus on nucleotide features, ignoring other representations of genomic information. Here we investigate the predictive potential of features generated from four different ‘levels’ of viral genome representation: nucleotide, amino acid, amino acid properties and protein domains. This more fully exploits the biological information present in the virus genomes. Over a hundred and eighty binary datasets for infecting versus non-infecting viruses at all taxonomic ranks of both eukaryote and prokaryote hosts were compiled. The viral genomes were converted into the four different levels of genome representation and twenty feature sets were generated by extracting k-mer compositions and predicted protein domains. We trained and tested Support Vector Machine, SVM, classifiers to compare the predictive capacity of each of these feature sets for each dataset. Our results show that all levels of genome representation are consistently predictive of host taxonomy and that prediction k-mer composition improves with increasing k-mer length for all k-mer based features. Using a phylogenetically aware holdout method, we demonstrate that the predictive feature sets contain signals reflecting both the evolutionary relationship between the viruses infecting related hosts, and host-mimicry. Our results demonstrate that incorporating a range of complementary features, generated purely from virus genome sequences, leads to improved accuracy for a range of virus host prediction tasks enabling computational assignment of host taxonomic information. Public Library of Science 2020-05-26 /pmc/articles/PMC7307784/ /pubmed/32453718 http://dx.doi.org/10.1371/journal.pcbi.1007894 Text en © 2020 Young et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. |
spellingShingle | Research Article Young, Francesca Rogers, Simon Robertson, David L. Predicting host taxonomic information from viral genomes: A comparison of feature representations |
title | Predicting host taxonomic information from viral genomes: A comparison of feature representations |
title_full | Predicting host taxonomic information from viral genomes: A comparison of feature representations |
title_fullStr | Predicting host taxonomic information from viral genomes: A comparison of feature representations |
title_full_unstemmed | Predicting host taxonomic information from viral genomes: A comparison of feature representations |
title_short | Predicting host taxonomic information from viral genomes: A comparison of feature representations |
title_sort | predicting host taxonomic information from viral genomes: a comparison of feature representations |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7307784/ https://www.ncbi.nlm.nih.gov/pubmed/32453718 http://dx.doi.org/10.1371/journal.pcbi.1007894 |
work_keys_str_mv | AT youngfrancesca predictinghosttaxonomicinformationfromviralgenomesacomparisonoffeaturerepresentations AT rogerssimon predictinghosttaxonomicinformationfromviralgenomesacomparisonoffeaturerepresentations AT robertsondavidl predictinghosttaxonomicinformationfromviralgenomesacomparisonoffeaturerepresentations |