Cargando…
PfaSTer: a machine learning-powered serotype caller for Streptococcus pneumoniae genomes
Streptococcus pneumoniae (pneumococcus) is a leading cause of morbidity and mortality worldwide. Although multi-valent pneumococcal vaccines have curbed the incidence of disease, their introduction has resulted in shifted serotype distributions that must be monitored. Whole genome sequence (WGS) dat...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Microbiology Society
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10327508/ https://www.ncbi.nlm.nih.gov/pubmed/37279053 http://dx.doi.org/10.1099/mgen.0.001033 |
_version_ | 1785069642132226048 |
---|---|
author | Lee, Jonathan T. Li, Xingpeng Hyde, Craig Liberator, Paul A. Hao, Li |
author_facet | Lee, Jonathan T. Li, Xingpeng Hyde, Craig Liberator, Paul A. Hao, Li |
author_sort | Lee, Jonathan T. |
collection | PubMed |
description | Streptococcus pneumoniae (pneumococcus) is a leading cause of morbidity and mortality worldwide. Although multi-valent pneumococcal vaccines have curbed the incidence of disease, their introduction has resulted in shifted serotype distributions that must be monitored. Whole genome sequence (WGS) data provide a powerful surveillance tool for tracking isolate serotypes, which can be determined from nucleotide sequence of the capsular polysaccharide biosynthetic operon (cps). Although software exists to predict serotypes from WGS data, most are constrained by requiring high-coverage next-generation sequencing reads. This can present a challenge in respect of accessibility and data sharing. Here we present PfaSTer, a machine learning-based method to identify 65 prevalent serotypes from assembled S. pneumoniae genome sequences. PfaSTer combines dimensionality reduction from k-mer analysis with a Random Forest classifier for rapid serotype prediction. By leveraging the model’s built-in statistical framework, PfaSTer determines confidence in its predictions without the need for coverage-based assessments. We then demonstrate the robustness of this method, returning >97 % concordance when compared to biochemical results and other in silico serotyping tools. PfaSTer is open source and available at: https://github.com/pfizer-opensource/pfaster. |
format | Online Article Text |
id | pubmed-10327508 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Microbiology Society |
record_format | MEDLINE/PubMed |
spelling | pubmed-103275082023-07-08 PfaSTer: a machine learning-powered serotype caller for Streptococcus pneumoniae genomes Lee, Jonathan T. Li, Xingpeng Hyde, Craig Liberator, Paul A. Hao, Li Microb Genom Short Communications Streptococcus pneumoniae (pneumococcus) is a leading cause of morbidity and mortality worldwide. Although multi-valent pneumococcal vaccines have curbed the incidence of disease, their introduction has resulted in shifted serotype distributions that must be monitored. Whole genome sequence (WGS) data provide a powerful surveillance tool for tracking isolate serotypes, which can be determined from nucleotide sequence of the capsular polysaccharide biosynthetic operon (cps). Although software exists to predict serotypes from WGS data, most are constrained by requiring high-coverage next-generation sequencing reads. This can present a challenge in respect of accessibility and data sharing. Here we present PfaSTer, a machine learning-based method to identify 65 prevalent serotypes from assembled S. pneumoniae genome sequences. PfaSTer combines dimensionality reduction from k-mer analysis with a Random Forest classifier for rapid serotype prediction. By leveraging the model’s built-in statistical framework, PfaSTer determines confidence in its predictions without the need for coverage-based assessments. We then demonstrate the robustness of this method, returning >97 % concordance when compared to biochemical results and other in silico serotyping tools. PfaSTer is open source and available at: https://github.com/pfizer-opensource/pfaster. Microbiology Society 2023-06-06 /pmc/articles/PMC10327508/ /pubmed/37279053 http://dx.doi.org/10.1099/mgen.0.001033 Text en © 2023 The Authors https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License. |
spellingShingle | Short Communications Lee, Jonathan T. Li, Xingpeng Hyde, Craig Liberator, Paul A. Hao, Li PfaSTer: a machine learning-powered serotype caller for Streptococcus pneumoniae genomes |
title | PfaSTer: a machine learning-powered serotype caller for Streptococcus pneumoniae genomes |
title_full | PfaSTer: a machine learning-powered serotype caller for Streptococcus pneumoniae genomes |
title_fullStr | PfaSTer: a machine learning-powered serotype caller for Streptococcus pneumoniae genomes |
title_full_unstemmed | PfaSTer: a machine learning-powered serotype caller for Streptococcus pneumoniae genomes |
title_short | PfaSTer: a machine learning-powered serotype caller for Streptococcus pneumoniae genomes |
title_sort | pfaster: a machine learning-powered serotype caller for streptococcus pneumoniae genomes |
topic | Short Communications |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10327508/ https://www.ncbi.nlm.nih.gov/pubmed/37279053 http://dx.doi.org/10.1099/mgen.0.001033 |
work_keys_str_mv | AT leejonathant pfasteramachinelearningpoweredserotypecallerforstreptococcuspneumoniaegenomes AT lixingpeng pfasteramachinelearningpoweredserotypecallerforstreptococcuspneumoniaegenomes AT hydecraig pfasteramachinelearningpoweredserotypecallerforstreptococcuspneumoniaegenomes AT liberatorpaula pfasteramachinelearningpoweredserotypecallerforstreptococcuspneumoniaegenomes AT haoli pfasteramachinelearningpoweredserotypecallerforstreptococcuspneumoniaegenomes |