Cargando…

PfaSTer: a machine learning-powered serotype caller for Streptococcus pneumoniae genomes

Streptococcus pneumoniae (pneumococcus) is a leading cause of morbidity and mortality worldwide. Although multi-valent pneumococcal vaccines have curbed the incidence of disease, their introduction has resulted in shifted serotype distributions that must be monitored. Whole genome sequence (WGS) dat...

Descripción completa

Detalles Bibliográficos
Autores principales: Lee, Jonathan T., Li, Xingpeng, Hyde, Craig, Liberator, Paul A., Hao, Li
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Microbiology Society 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10327508/
https://www.ncbi.nlm.nih.gov/pubmed/37279053
http://dx.doi.org/10.1099/mgen.0.001033
_version_ 1785069642132226048
author Lee, Jonathan T.
Li, Xingpeng
Hyde, Craig
Liberator, Paul A.
Hao, Li
author_facet Lee, Jonathan T.
Li, Xingpeng
Hyde, Craig
Liberator, Paul A.
Hao, Li
author_sort Lee, Jonathan T.
collection PubMed
description Streptococcus pneumoniae (pneumococcus) is a leading cause of morbidity and mortality worldwide. Although multi-valent pneumococcal vaccines have curbed the incidence of disease, their introduction has resulted in shifted serotype distributions that must be monitored. Whole genome sequence (WGS) data provide a powerful surveillance tool for tracking isolate serotypes, which can be determined from nucleotide sequence of the capsular polysaccharide biosynthetic operon (cps). Although software exists to predict serotypes from WGS data, most are constrained by requiring high-coverage next-generation sequencing reads. This can present a challenge in respect of accessibility and data sharing. Here we present PfaSTer, a machine learning-based method to identify 65 prevalent serotypes from assembled S. pneumoniae genome sequences. PfaSTer combines dimensionality reduction from k-mer analysis with a Random Forest classifier for rapid serotype prediction. By leveraging the model’s built-in statistical framework, PfaSTer determines confidence in its predictions without the need for coverage-based assessments. We then demonstrate the robustness of this method, returning >97 % concordance when compared to biochemical results and other in silico serotyping tools. PfaSTer is open source and available at: https://github.com/pfizer-opensource/pfaster.
format Online
Article
Text
id pubmed-10327508
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Microbiology Society
record_format MEDLINE/PubMed
spelling pubmed-103275082023-07-08 PfaSTer: a machine learning-powered serotype caller for Streptococcus pneumoniae genomes Lee, Jonathan T. Li, Xingpeng Hyde, Craig Liberator, Paul A. Hao, Li Microb Genom Short Communications Streptococcus pneumoniae (pneumococcus) is a leading cause of morbidity and mortality worldwide. Although multi-valent pneumococcal vaccines have curbed the incidence of disease, their introduction has resulted in shifted serotype distributions that must be monitored. Whole genome sequence (WGS) data provide a powerful surveillance tool for tracking isolate serotypes, which can be determined from nucleotide sequence of the capsular polysaccharide biosynthetic operon (cps). Although software exists to predict serotypes from WGS data, most are constrained by requiring high-coverage next-generation sequencing reads. This can present a challenge in respect of accessibility and data sharing. Here we present PfaSTer, a machine learning-based method to identify 65 prevalent serotypes from assembled S. pneumoniae genome sequences. PfaSTer combines dimensionality reduction from k-mer analysis with a Random Forest classifier for rapid serotype prediction. By leveraging the model’s built-in statistical framework, PfaSTer determines confidence in its predictions without the need for coverage-based assessments. We then demonstrate the robustness of this method, returning >97 % concordance when compared to biochemical results and other in silico serotyping tools. PfaSTer is open source and available at: https://github.com/pfizer-opensource/pfaster. Microbiology Society 2023-06-06 /pmc/articles/PMC10327508/ /pubmed/37279053 http://dx.doi.org/10.1099/mgen.0.001033 Text en © 2023 The Authors https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License.
spellingShingle Short Communications
Lee, Jonathan T.
Li, Xingpeng
Hyde, Craig
Liberator, Paul A.
Hao, Li
PfaSTer: a machine learning-powered serotype caller for Streptococcus pneumoniae genomes
title PfaSTer: a machine learning-powered serotype caller for Streptococcus pneumoniae genomes
title_full PfaSTer: a machine learning-powered serotype caller for Streptococcus pneumoniae genomes
title_fullStr PfaSTer: a machine learning-powered serotype caller for Streptococcus pneumoniae genomes
title_full_unstemmed PfaSTer: a machine learning-powered serotype caller for Streptococcus pneumoniae genomes
title_short PfaSTer: a machine learning-powered serotype caller for Streptococcus pneumoniae genomes
title_sort pfaster: a machine learning-powered serotype caller for streptococcus pneumoniae genomes
topic Short Communications
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10327508/
https://www.ncbi.nlm.nih.gov/pubmed/37279053
http://dx.doi.org/10.1099/mgen.0.001033
work_keys_str_mv AT leejonathant pfasteramachinelearningpoweredserotypecallerforstreptococcuspneumoniaegenomes
AT lixingpeng pfasteramachinelearningpoweredserotypecallerforstreptococcuspneumoniaegenomes
AT hydecraig pfasteramachinelearningpoweredserotypecallerforstreptococcuspneumoniaegenomes
AT liberatorpaula pfasteramachinelearningpoweredserotypecallerforstreptococcuspneumoniaegenomes
AT haoli pfasteramachinelearningpoweredserotypecallerforstreptococcuspneumoniaegenomes