Cargando…

A guide to machine learning for bacterial host attribution using genome sequence data

With the ever-expanding number of available sequences from bacterial genomes, and the expectation that this data type will be the primary one generated from both diagnostic and research laboratories for the foreseeable future, then there is both an opportunity and a need to evaluate how effectively...

Descripción completa

Detalles Bibliográficos
Autores principales: Lupolova, Nadejda, Lycett, Samantha J., Gally, David L.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Microbiology Society 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6939162/
https://www.ncbi.nlm.nih.gov/pubmed/31778355
http://dx.doi.org/10.1099/mgen.0.000317
_version_ 1783484175971516416
author Lupolova, Nadejda
Lycett, Samantha J.
Gally, David L.
author_facet Lupolova, Nadejda
Lycett, Samantha J.
Gally, David L.
author_sort Lupolova, Nadejda
collection PubMed
description With the ever-expanding number of available sequences from bacterial genomes, and the expectation that this data type will be the primary one generated from both diagnostic and research laboratories for the foreseeable future, then there is both an opportunity and a need to evaluate how effectively computational approaches can be used within bacterial genomics to predict and understand complex phenotypes, such as pathogenic potential and host source. This article applied various quantitative methods such as diversity indexes, pangenome-wide association studies (GWAS) and dimensionality reduction techniques to better understand the data and then compared how well unsupervised and supervised machine learning (ML) methods could predict the source host of the isolates. The study uses the example of the pangenomes of 1203 Salmonella enterica serovar Typhimurium isolates in order to predict 'host of isolation' using these different methods. The article is aimed as a review of recent applications of ML in infection biology, but also, by working through this specific dataset, it allows discussion of the advantages and drawbacks of the different techniques. As with all such sub-population studies, the biological relevance will be dependent on the quality and diversity of the input data. Given this major caveat, we show that supervised ML has the potential to add real value to interpretation of bacterial genomic data, as it can provide probabilistic outcomes for important phenotypes, something that is very difficult to achieve with the other methods.
format Online
Article
Text
id pubmed-6939162
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Microbiology Society
record_format MEDLINE/PubMed
spelling pubmed-69391622020-01-02 A guide to machine learning for bacterial host attribution using genome sequence data Lupolova, Nadejda Lycett, Samantha J. Gally, David L. Microb Genom Review With the ever-expanding number of available sequences from bacterial genomes, and the expectation that this data type will be the primary one generated from both diagnostic and research laboratories for the foreseeable future, then there is both an opportunity and a need to evaluate how effectively computational approaches can be used within bacterial genomics to predict and understand complex phenotypes, such as pathogenic potential and host source. This article applied various quantitative methods such as diversity indexes, pangenome-wide association studies (GWAS) and dimensionality reduction techniques to better understand the data and then compared how well unsupervised and supervised machine learning (ML) methods could predict the source host of the isolates. The study uses the example of the pangenomes of 1203 Salmonella enterica serovar Typhimurium isolates in order to predict 'host of isolation' using these different methods. The article is aimed as a review of recent applications of ML in infection biology, but also, by working through this specific dataset, it allows discussion of the advantages and drawbacks of the different techniques. As with all such sub-population studies, the biological relevance will be dependent on the quality and diversity of the input data. Given this major caveat, we show that supervised ML has the potential to add real value to interpretation of bacterial genomic data, as it can provide probabilistic outcomes for important phenotypes, something that is very difficult to achieve with the other methods. Microbiology Society 2019-11-28 /pmc/articles/PMC6939162/ /pubmed/31778355 http://dx.doi.org/10.1099/mgen.0.000317 Text en © 2019 The Authors http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License.
spellingShingle Review
Lupolova, Nadejda
Lycett, Samantha J.
Gally, David L.
A guide to machine learning for bacterial host attribution using genome sequence data
title A guide to machine learning for bacterial host attribution using genome sequence data
title_full A guide to machine learning for bacterial host attribution using genome sequence data
title_fullStr A guide to machine learning for bacterial host attribution using genome sequence data
title_full_unstemmed A guide to machine learning for bacterial host attribution using genome sequence data
title_short A guide to machine learning for bacterial host attribution using genome sequence data
title_sort guide to machine learning for bacterial host attribution using genome sequence data
topic Review
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6939162/
https://www.ncbi.nlm.nih.gov/pubmed/31778355
http://dx.doi.org/10.1099/mgen.0.000317
work_keys_str_mv AT lupolovanadejda aguidetomachinelearningforbacterialhostattributionusinggenomesequencedata
AT lycettsamanthaj aguidetomachinelearningforbacterialhostattributionusinggenomesequencedata
AT gallydavidl aguidetomachinelearningforbacterialhostattributionusinggenomesequencedata
AT lupolovanadejda guidetomachinelearningforbacterialhostattributionusinggenomesequencedata
AT lycettsamanthaj guidetomachinelearningforbacterialhostattributionusinggenomesequencedata
AT gallydavidl guidetomachinelearningforbacterialhostattributionusinggenomesequencedata