Cargando…

Lessons Learnt From Using the Machine Learning Random Forest Algorithm to Predict Virulence in Streptococcus pyogenes

Group A Streptococcus is a globally significant human pathogen. The extensive variability of the GAS genome, virulence phenotypes and clinical outcomes, render it an excellent candidate for the application of genotype-phenotype association studies in the era of whole-genome sequencing. We have catal...

Descripción completa

Detalles Bibliográficos
Autores principales: Buckley, Sean J., Harvey, Robert J.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8739889/
https://www.ncbi.nlm.nih.gov/pubmed/35004362
http://dx.doi.org/10.3389/fcimb.2021.809560
_version_ 1784629197316030464
author Buckley, Sean J.
Harvey, Robert J.
author_facet Buckley, Sean J.
Harvey, Robert J.
author_sort Buckley, Sean J.
collection PubMed
description Group A Streptococcus is a globally significant human pathogen. The extensive variability of the GAS genome, virulence phenotypes and clinical outcomes, render it an excellent candidate for the application of genotype-phenotype association studies in the era of whole-genome sequencing. We have catalogued the distribution and diversity of the transcription regulators of GAS, and employed phylogenetics, concordance metrics and machine learning (ML) to test for associations. In this review, we communicate the lessons learnt in the context of the recent bacteria genotype-phenotype association studies of others that have utilised both genome-wide association studies (GWAS) and ML. We envisage a promising future for the application GWAS in bacteria genotype-phenotype association studies and foresee the increasing use of ML. However, progress in this field is hindered by several outstanding bottlenecks. These include the shortcomings that are observed when GWAS techniques that have been fine-tuned on human genomes, are applied to bacterial genomes. Furthermore, there is a deficit of easy-to-use end-to-end workflows, and a lag in the collection of detailed phenotype and clinical genomic metadata. We propose a novel quality control protocol for the collection of high-quality GAS virulence phenotype coupled to clinical outcome data. Finally, we incorporate this protocol into a workflow for testing genotype-phenotype associations using ML and ‘linked’ patient-microbe genome sets that better represent the infection event.
format Online
Article
Text
id pubmed-8739889
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-87398892022-01-08 Lessons Learnt From Using the Machine Learning Random Forest Algorithm to Predict Virulence in Streptococcus pyogenes Buckley, Sean J. Harvey, Robert J. Front Cell Infect Microbiol Cellular and Infection Microbiology Group A Streptococcus is a globally significant human pathogen. The extensive variability of the GAS genome, virulence phenotypes and clinical outcomes, render it an excellent candidate for the application of genotype-phenotype association studies in the era of whole-genome sequencing. We have catalogued the distribution and diversity of the transcription regulators of GAS, and employed phylogenetics, concordance metrics and machine learning (ML) to test for associations. In this review, we communicate the lessons learnt in the context of the recent bacteria genotype-phenotype association studies of others that have utilised both genome-wide association studies (GWAS) and ML. We envisage a promising future for the application GWAS in bacteria genotype-phenotype association studies and foresee the increasing use of ML. However, progress in this field is hindered by several outstanding bottlenecks. These include the shortcomings that are observed when GWAS techniques that have been fine-tuned on human genomes, are applied to bacterial genomes. Furthermore, there is a deficit of easy-to-use end-to-end workflows, and a lag in the collection of detailed phenotype and clinical genomic metadata. We propose a novel quality control protocol for the collection of high-quality GAS virulence phenotype coupled to clinical outcome data. Finally, we incorporate this protocol into a workflow for testing genotype-phenotype associations using ML and ‘linked’ patient-microbe genome sets that better represent the infection event. Frontiers Media S.A. 2021-12-24 /pmc/articles/PMC8739889/ /pubmed/35004362 http://dx.doi.org/10.3389/fcimb.2021.809560 Text en Copyright © 2021 Buckley and Harvey https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Cellular and Infection Microbiology
Buckley, Sean J.
Harvey, Robert J.
Lessons Learnt From Using the Machine Learning Random Forest Algorithm to Predict Virulence in Streptococcus pyogenes
title Lessons Learnt From Using the Machine Learning Random Forest Algorithm to Predict Virulence in Streptococcus pyogenes
title_full Lessons Learnt From Using the Machine Learning Random Forest Algorithm to Predict Virulence in Streptococcus pyogenes
title_fullStr Lessons Learnt From Using the Machine Learning Random Forest Algorithm to Predict Virulence in Streptococcus pyogenes
title_full_unstemmed Lessons Learnt From Using the Machine Learning Random Forest Algorithm to Predict Virulence in Streptococcus pyogenes
title_short Lessons Learnt From Using the Machine Learning Random Forest Algorithm to Predict Virulence in Streptococcus pyogenes
title_sort lessons learnt from using the machine learning random forest algorithm to predict virulence in streptococcus pyogenes
topic Cellular and Infection Microbiology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8739889/
https://www.ncbi.nlm.nih.gov/pubmed/35004362
http://dx.doi.org/10.3389/fcimb.2021.809560
work_keys_str_mv AT buckleyseanj lessonslearntfromusingthemachinelearningrandomforestalgorithmtopredictvirulenceinstreptococcuspyogenes
AT harveyrobertj lessonslearntfromusingthemachinelearningrandomforestalgorithmtopredictvirulenceinstreptococcuspyogenes