Cargando…
A guide to machine learning for bacterial host attribution using genome sequence data
With the ever-expanding number of available sequences from bacterial genomes, and the expectation that this data type will be the primary one generated from both diagnostic and research laboratories for the foreseeable future, then there is both an opportunity and a need to evaluate how effectively...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Microbiology Society
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6939162/ https://www.ncbi.nlm.nih.gov/pubmed/31778355 http://dx.doi.org/10.1099/mgen.0.000317 |
_version_ | 1783484175971516416 |
---|---|
author | Lupolova, Nadejda Lycett, Samantha J. Gally, David L. |
author_facet | Lupolova, Nadejda Lycett, Samantha J. Gally, David L. |
author_sort | Lupolova, Nadejda |
collection | PubMed |
description | With the ever-expanding number of available sequences from bacterial genomes, and the expectation that this data type will be the primary one generated from both diagnostic and research laboratories for the foreseeable future, then there is both an opportunity and a need to evaluate how effectively computational approaches can be used within bacterial genomics to predict and understand complex phenotypes, such as pathogenic potential and host source. This article applied various quantitative methods such as diversity indexes, pangenome-wide association studies (GWAS) and dimensionality reduction techniques to better understand the data and then compared how well unsupervised and supervised machine learning (ML) methods could predict the source host of the isolates. The study uses the example of the pangenomes of 1203 Salmonella enterica serovar Typhimurium isolates in order to predict 'host of isolation' using these different methods. The article is aimed as a review of recent applications of ML in infection biology, but also, by working through this specific dataset, it allows discussion of the advantages and drawbacks of the different techniques. As with all such sub-population studies, the biological relevance will be dependent on the quality and diversity of the input data. Given this major caveat, we show that supervised ML has the potential to add real value to interpretation of bacterial genomic data, as it can provide probabilistic outcomes for important phenotypes, something that is very difficult to achieve with the other methods. |
format | Online Article Text |
id | pubmed-6939162 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2019 |
publisher | Microbiology Society |
record_format | MEDLINE/PubMed |
spelling | pubmed-69391622020-01-02 A guide to machine learning for bacterial host attribution using genome sequence data Lupolova, Nadejda Lycett, Samantha J. Gally, David L. Microb Genom Review With the ever-expanding number of available sequences from bacterial genomes, and the expectation that this data type will be the primary one generated from both diagnostic and research laboratories for the foreseeable future, then there is both an opportunity and a need to evaluate how effectively computational approaches can be used within bacterial genomics to predict and understand complex phenotypes, such as pathogenic potential and host source. This article applied various quantitative methods such as diversity indexes, pangenome-wide association studies (GWAS) and dimensionality reduction techniques to better understand the data and then compared how well unsupervised and supervised machine learning (ML) methods could predict the source host of the isolates. The study uses the example of the pangenomes of 1203 Salmonella enterica serovar Typhimurium isolates in order to predict 'host of isolation' using these different methods. The article is aimed as a review of recent applications of ML in infection biology, but also, by working through this specific dataset, it allows discussion of the advantages and drawbacks of the different techniques. As with all such sub-population studies, the biological relevance will be dependent on the quality and diversity of the input data. Given this major caveat, we show that supervised ML has the potential to add real value to interpretation of bacterial genomic data, as it can provide probabilistic outcomes for important phenotypes, something that is very difficult to achieve with the other methods. Microbiology Society 2019-11-28 /pmc/articles/PMC6939162/ /pubmed/31778355 http://dx.doi.org/10.1099/mgen.0.000317 Text en © 2019 The Authors http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License. |
spellingShingle | Review Lupolova, Nadejda Lycett, Samantha J. Gally, David L. A guide to machine learning for bacterial host attribution using genome sequence data |
title | A guide to machine learning for bacterial host attribution using genome sequence data |
title_full | A guide to machine learning for bacterial host attribution using genome sequence data |
title_fullStr | A guide to machine learning for bacterial host attribution using genome sequence data |
title_full_unstemmed | A guide to machine learning for bacterial host attribution using genome sequence data |
title_short | A guide to machine learning for bacterial host attribution using genome sequence data |
title_sort | guide to machine learning for bacterial host attribution using genome sequence data |
topic | Review |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6939162/ https://www.ncbi.nlm.nih.gov/pubmed/31778355 http://dx.doi.org/10.1099/mgen.0.000317 |
work_keys_str_mv | AT lupolovanadejda aguidetomachinelearningforbacterialhostattributionusinggenomesequencedata AT lycettsamanthaj aguidetomachinelearningforbacterialhostattributionusinggenomesequencedata AT gallydavidl aguidetomachinelearningforbacterialhostattributionusinggenomesequencedata AT lupolovanadejda guidetomachinelearningforbacterialhostattributionusinggenomesequencedata AT lycettsamanthaj guidetomachinelearningforbacterialhostattributionusinggenomesequencedata AT gallydavidl guidetomachinelearningforbacterialhostattributionusinggenomesequencedata |