Cargando…

Predicting host taxonomic information from viral genomes: A comparison of feature representations

The rise in metagenomics has led to an exponential growth in virus discovery. However, the majority of these new virus sequences have no assigned host. Current machine learning approaches to predicting virus host interactions have a tendency to focus on nucleotide features, ignoring other representa...

Descripción completa

Detalles Bibliográficos
Autores principales: Young, Francesca, Rogers, Simon, Robertson, David L.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7307784/
https://www.ncbi.nlm.nih.gov/pubmed/32453718
http://dx.doi.org/10.1371/journal.pcbi.1007894
_version_ 1783548870222938112
author Young, Francesca
Rogers, Simon
Robertson, David L.
author_facet Young, Francesca
Rogers, Simon
Robertson, David L.
author_sort Young, Francesca
collection PubMed
description The rise in metagenomics has led to an exponential growth in virus discovery. However, the majority of these new virus sequences have no assigned host. Current machine learning approaches to predicting virus host interactions have a tendency to focus on nucleotide features, ignoring other representations of genomic information. Here we investigate the predictive potential of features generated from four different ‘levels’ of viral genome representation: nucleotide, amino acid, amino acid properties and protein domains. This more fully exploits the biological information present in the virus genomes. Over a hundred and eighty binary datasets for infecting versus non-infecting viruses at all taxonomic ranks of both eukaryote and prokaryote hosts were compiled. The viral genomes were converted into the four different levels of genome representation and twenty feature sets were generated by extracting k-mer compositions and predicted protein domains. We trained and tested Support Vector Machine, SVM, classifiers to compare the predictive capacity of each of these feature sets for each dataset. Our results show that all levels of genome representation are consistently predictive of host taxonomy and that prediction k-mer composition improves with increasing k-mer length for all k-mer based features. Using a phylogenetically aware holdout method, we demonstrate that the predictive feature sets contain signals reflecting both the evolutionary relationship between the viruses infecting related hosts, and host-mimicry. Our results demonstrate that incorporating a range of complementary features, generated purely from virus genome sequences, leads to improved accuracy for a range of virus host prediction tasks enabling computational assignment of host taxonomic information.
format Online
Article
Text
id pubmed-7307784
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-73077842020-06-24 Predicting host taxonomic information from viral genomes: A comparison of feature representations Young, Francesca Rogers, Simon Robertson, David L. PLoS Comput Biol Research Article The rise in metagenomics has led to an exponential growth in virus discovery. However, the majority of these new virus sequences have no assigned host. Current machine learning approaches to predicting virus host interactions have a tendency to focus on nucleotide features, ignoring other representations of genomic information. Here we investigate the predictive potential of features generated from four different ‘levels’ of viral genome representation: nucleotide, amino acid, amino acid properties and protein domains. This more fully exploits the biological information present in the virus genomes. Over a hundred and eighty binary datasets for infecting versus non-infecting viruses at all taxonomic ranks of both eukaryote and prokaryote hosts were compiled. The viral genomes were converted into the four different levels of genome representation and twenty feature sets were generated by extracting k-mer compositions and predicted protein domains. We trained and tested Support Vector Machine, SVM, classifiers to compare the predictive capacity of each of these feature sets for each dataset. Our results show that all levels of genome representation are consistently predictive of host taxonomy and that prediction k-mer composition improves with increasing k-mer length for all k-mer based features. Using a phylogenetically aware holdout method, we demonstrate that the predictive feature sets contain signals reflecting both the evolutionary relationship between the viruses infecting related hosts, and host-mimicry. Our results demonstrate that incorporating a range of complementary features, generated purely from virus genome sequences, leads to improved accuracy for a range of virus host prediction tasks enabling computational assignment of host taxonomic information. Public Library of Science 2020-05-26 /pmc/articles/PMC7307784/ /pubmed/32453718 http://dx.doi.org/10.1371/journal.pcbi.1007894 Text en © 2020 Young et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Young, Francesca
Rogers, Simon
Robertson, David L.
Predicting host taxonomic information from viral genomes: A comparison of feature representations
title Predicting host taxonomic information from viral genomes: A comparison of feature representations
title_full Predicting host taxonomic information from viral genomes: A comparison of feature representations
title_fullStr Predicting host taxonomic information from viral genomes: A comparison of feature representations
title_full_unstemmed Predicting host taxonomic information from viral genomes: A comparison of feature representations
title_short Predicting host taxonomic information from viral genomes: A comparison of feature representations
title_sort predicting host taxonomic information from viral genomes: a comparison of feature representations
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7307784/
https://www.ncbi.nlm.nih.gov/pubmed/32453718
http://dx.doi.org/10.1371/journal.pcbi.1007894
work_keys_str_mv AT youngfrancesca predictinghosttaxonomicinformationfromviralgenomesacomparisonoffeaturerepresentations
AT rogerssimon predictinghosttaxonomicinformationfromviralgenomesacomparisonoffeaturerepresentations
AT robertsondavidl predictinghosttaxonomicinformationfromviralgenomesacomparisonoffeaturerepresentations