Cargando…

The advantage of intergenic regions as genomic features for machine-learning-based host attribution of Salmonella Typhimurium from the USA

Salmonella enterica is a taxonomically diverse pathogen with over 2600 serovars associated with a wide variety of animal hosts including humans, other mammals, birds and reptiles. Some serovars are host-specific or host-restricted and cause disease in distinct host species, while others, such as ser...

Descripción completa

Detalles Bibliográficos
Autores principales: Chalka, Antonia, Dallman, Tim J., Vohra, Prerna, Stevens, Mark P., Gally, David L.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Microbiology Society 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10634445/
https://www.ncbi.nlm.nih.gov/pubmed/37843883
http://dx.doi.org/10.1099/mgen.0.001116
_version_ 1785146211712368640
author Chalka, Antonia
Dallman, Tim J.
Vohra, Prerna
Stevens, Mark P.
Gally, David L.
author_facet Chalka, Antonia
Dallman, Tim J.
Vohra, Prerna
Stevens, Mark P.
Gally, David L.
author_sort Chalka, Antonia
collection PubMed
description Salmonella enterica is a taxonomically diverse pathogen with over 2600 serovars associated with a wide variety of animal hosts including humans, other mammals, birds and reptiles. Some serovars are host-specific or host-restricted and cause disease in distinct host species, while others, such as serovar S. Typhimurium (STm), are generalists and have the potential to colonize a wide variety of species. However, even within generalist serovars such as STm it is becoming clear that pathovariants exist that differ in tropism and virulence. Identifying the genetic factors underlying host specificity is complex, but the availability of thousands of genome sequences and advances in machine learning have made it possible to build specific host prediction models to aid outbreak control and predict the human pathogenic potential of isolates from animals and other reservoirs. We have advanced this area by building host-association prediction models trained on a wide range of genomic features and compared them with predictions based on nearest-neighbour phylogeny. SNPs, protein variants (PVs), antimicrobial resistance (AMR) profiles and intergenic regions (IGRs) were extracted from 3883 high-quality STm assemblies collected from humans, swine, bovine and poultry in the USA, and used to construct Random Forest (RF) machine learning models. An additional 244 recent STm assemblies from farm animals were used as a test set for further validation. The models based on PVs and IGRs had the best performance in terms of predicting the host of origin of isolates and outperformed nearest-neighbour phylogenetic host prediction as well as models based on SNPs or AMR data. However, the models did not yield reliable predictions when tested with isolates that were phylogenetically distinct from the training set. The IGR and PV models were often able to differentiate human isolates in clusters where the majority of isolates were from a single animal source. Notably, IGRs were the feature with the best performance across multiple models which may be due to IGRs acting as both a representation of their flanking genes, equivalent to PVs, while also capturing genomic regulatory variation, such as altered promoter regions. The IGR and PV models predict that ~45 % of the human infections with STm in the USA originate from bovine, ~40 % from poultry and ~14.5 % from swine, although sequences of isolates from other sources were not used for training. In summary, the research demonstrates a significant gain in accuracy for models with IGRs and PVs as features compared to SNP-based and core genome phylogeny predictions when applied within the existing population structure. This article contains data hosted by Microreact.
format Online
Article
Text
id pubmed-10634445
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Microbiology Society
record_format MEDLINE/PubMed
spelling pubmed-106344452023-11-15 The advantage of intergenic regions as genomic features for machine-learning-based host attribution of Salmonella Typhimurium from the USA Chalka, Antonia Dallman, Tim J. Vohra, Prerna Stevens, Mark P. Gally, David L. Microb Genom Research Articles Salmonella enterica is a taxonomically diverse pathogen with over 2600 serovars associated with a wide variety of animal hosts including humans, other mammals, birds and reptiles. Some serovars are host-specific or host-restricted and cause disease in distinct host species, while others, such as serovar S. Typhimurium (STm), are generalists and have the potential to colonize a wide variety of species. However, even within generalist serovars such as STm it is becoming clear that pathovariants exist that differ in tropism and virulence. Identifying the genetic factors underlying host specificity is complex, but the availability of thousands of genome sequences and advances in machine learning have made it possible to build specific host prediction models to aid outbreak control and predict the human pathogenic potential of isolates from animals and other reservoirs. We have advanced this area by building host-association prediction models trained on a wide range of genomic features and compared them with predictions based on nearest-neighbour phylogeny. SNPs, protein variants (PVs), antimicrobial resistance (AMR) profiles and intergenic regions (IGRs) were extracted from 3883 high-quality STm assemblies collected from humans, swine, bovine and poultry in the USA, and used to construct Random Forest (RF) machine learning models. An additional 244 recent STm assemblies from farm animals were used as a test set for further validation. The models based on PVs and IGRs had the best performance in terms of predicting the host of origin of isolates and outperformed nearest-neighbour phylogenetic host prediction as well as models based on SNPs or AMR data. However, the models did not yield reliable predictions when tested with isolates that were phylogenetically distinct from the training set. The IGR and PV models were often able to differentiate human isolates in clusters where the majority of isolates were from a single animal source. Notably, IGRs were the feature with the best performance across multiple models which may be due to IGRs acting as both a representation of their flanking genes, equivalent to PVs, while also capturing genomic regulatory variation, such as altered promoter regions. The IGR and PV models predict that ~45 % of the human infections with STm in the USA originate from bovine, ~40 % from poultry and ~14.5 % from swine, although sequences of isolates from other sources were not used for training. In summary, the research demonstrates a significant gain in accuracy for models with IGRs and PVs as features compared to SNP-based and core genome phylogeny predictions when applied within the existing population structure. This article contains data hosted by Microreact. Microbiology Society 2023-10-16 /pmc/articles/PMC10634445/ /pubmed/37843883 http://dx.doi.org/10.1099/mgen.0.001116 Text en © 2023 The Authors https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License. This article was made open access via a Publish and Read agreement between the Microbiology Society and the corresponding author’s institution.
spellingShingle Research Articles
Chalka, Antonia
Dallman, Tim J.
Vohra, Prerna
Stevens, Mark P.
Gally, David L.
The advantage of intergenic regions as genomic features for machine-learning-based host attribution of Salmonella Typhimurium from the USA
title The advantage of intergenic regions as genomic features for machine-learning-based host attribution of Salmonella Typhimurium from the USA
title_full The advantage of intergenic regions as genomic features for machine-learning-based host attribution of Salmonella Typhimurium from the USA
title_fullStr The advantage of intergenic regions as genomic features for machine-learning-based host attribution of Salmonella Typhimurium from the USA
title_full_unstemmed The advantage of intergenic regions as genomic features for machine-learning-based host attribution of Salmonella Typhimurium from the USA
title_short The advantage of intergenic regions as genomic features for machine-learning-based host attribution of Salmonella Typhimurium from the USA
title_sort advantage of intergenic regions as genomic features for machine-learning-based host attribution of salmonella typhimurium from the usa
topic Research Articles
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10634445/
https://www.ncbi.nlm.nih.gov/pubmed/37843883
http://dx.doi.org/10.1099/mgen.0.001116
work_keys_str_mv AT chalkaantonia theadvantageofintergenicregionsasgenomicfeaturesformachinelearningbasedhostattributionofsalmonellatyphimuriumfromtheusa
AT dallmantimj theadvantageofintergenicregionsasgenomicfeaturesformachinelearningbasedhostattributionofsalmonellatyphimuriumfromtheusa
AT vohraprerna theadvantageofintergenicregionsasgenomicfeaturesformachinelearningbasedhostattributionofsalmonellatyphimuriumfromtheusa
AT stevensmarkp theadvantageofintergenicregionsasgenomicfeaturesformachinelearningbasedhostattributionofsalmonellatyphimuriumfromtheusa
AT gallydavidl theadvantageofintergenicregionsasgenomicfeaturesformachinelearningbasedhostattributionofsalmonellatyphimuriumfromtheusa
AT chalkaantonia advantageofintergenicregionsasgenomicfeaturesformachinelearningbasedhostattributionofsalmonellatyphimuriumfromtheusa
AT dallmantimj advantageofintergenicregionsasgenomicfeaturesformachinelearningbasedhostattributionofsalmonellatyphimuriumfromtheusa
AT vohraprerna advantageofintergenicregionsasgenomicfeaturesformachinelearningbasedhostattributionofsalmonellatyphimuriumfromtheusa
AT stevensmarkp advantageofintergenicregionsasgenomicfeaturesformachinelearningbasedhostattributionofsalmonellatyphimuriumfromtheusa
AT gallydavidl advantageofintergenicregionsasgenomicfeaturesformachinelearningbasedhostattributionofsalmonellatyphimuriumfromtheusa