Cargando…

Predicting variable gene content in Escherichia coli using conserved genes

Having the ability to predict the protein-encoding gene content of an incomplete genome or metagenome-assembled genome is important for a variety of bioinformatic tasks. In this study, as a proof of concept, we built machine learning classifiers for predicting variable gene content in Escherichia co...

Descripción completa

Detalles Bibliográficos
Autores principales:	Nguyen, Marcus, Elmore, Zachary, Ihle, Clay, Moen, Francesco S., Slater, Adam D., Turner, Benjamin N., Parrello, Bruce, Best, Aaron A., Davis, James J.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	American Society for Microbiology 2023
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10469788/ https://www.ncbi.nlm.nih.gov/pubmed/37314210 http://dx.doi.org/10.1128/msystems.00058-23

_version_	1785099522870870016
author	Nguyen, Marcus Elmore, Zachary Ihle, Clay Moen, Francesco S. Slater, Adam D. Turner, Benjamin N. Parrello, Bruce Best, Aaron A. Davis, James J.
author_facet	Nguyen, Marcus Elmore, Zachary Ihle, Clay Moen, Francesco S. Slater, Adam D. Turner, Benjamin N. Parrello, Bruce Best, Aaron A. Davis, James J.
author_sort	Nguyen, Marcus
collection	PubMed
description	Having the ability to predict the protein-encoding gene content of an incomplete genome or metagenome-assembled genome is important for a variety of bioinformatic tasks. In this study, as a proof of concept, we built machine learning classifiers for predicting variable gene content in Escherichia coli genomes using only the nucleotide k-mers from a set of 100 conserved genes as features. Protein families were used to define orthologs, and a single classifier was built for predicting the presence or absence of each protein family occurring in 10%–90% of all E. coli genomes. The resulting set of 3,259 extreme gradient boosting classifiers had a per-genome average macro F1 score of 0.944 [0.943–0.945, 95% CI]. We show that the F1 scores are stable across multi-locus sequence types and that the trend can be recapitulated by sampling a smaller number of core genes or diverse input genomes. Surprisingly, the presence or absence of poorly annotated proteins, including “hypothetical proteins” was accurately predicted (F1 = 0.902 [0.898–0.906, 95% CI]). Models for proteins with horizontal gene transfer-related functions had slightly lower F1 scores but were still accurate (F1s = 0.895, 0.872, 0.824, and 0.841 for transposon, phage, plasmid, and antimicrobial resistance-related functions, respectively). Finally, using a holdout set of 419 diverse E. coli genomes that were isolated from freshwater environmental sources, we observed an average per-genome F1 score of 0.880 [0.876–0.883, 95% CI], demonstrating the extensibility of the models. Overall, this study provides a framework for predicting variable gene content using a limited amount of input sequence data. IMPORTANCE: Having the ability to predict the protein-encoding gene content of a genome is important for assessing genome quality, binning genomes from shotgun metagenomic assemblies, and assessing risk due to the presence of antimicrobial resistance and other virulence genes. In this study, we built a set of binary classifiers for predicting the presence or absence of variable genes occurring in 10%–90% of all publicly available E. coli genomes. Overall, the results show that a large portion of the E. coli variable gene content can be predicted with high accuracy, including genes with functions relating to horizontal gene transfer. This study offers a strategy for predicting gene content using limited input sequence data.
format	Online Article Text
id	pubmed-10469788
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	American Society for Microbiology
record_format	MEDLINE/PubMed
spelling	pubmed-104697882023-09-01 Predicting variable gene content in Escherichia coli using conserved genes Nguyen, Marcus Elmore, Zachary Ihle, Clay Moen, Francesco S. Slater, Adam D. Turner, Benjamin N. Parrello, Bruce Best, Aaron A. Davis, James J. mSystems Research Article Having the ability to predict the protein-encoding gene content of an incomplete genome or metagenome-assembled genome is important for a variety of bioinformatic tasks. In this study, as a proof of concept, we built machine learning classifiers for predicting variable gene content in Escherichia coli genomes using only the nucleotide k-mers from a set of 100 conserved genes as features. Protein families were used to define orthologs, and a single classifier was built for predicting the presence or absence of each protein family occurring in 10%–90% of all E. coli genomes. The resulting set of 3,259 extreme gradient boosting classifiers had a per-genome average macro F1 score of 0.944 [0.943–0.945, 95% CI]. We show that the F1 scores are stable across multi-locus sequence types and that the trend can be recapitulated by sampling a smaller number of core genes or diverse input genomes. Surprisingly, the presence or absence of poorly annotated proteins, including “hypothetical proteins” was accurately predicted (F1 = 0.902 [0.898–0.906, 95% CI]). Models for proteins with horizontal gene transfer-related functions had slightly lower F1 scores but were still accurate (F1s = 0.895, 0.872, 0.824, and 0.841 for transposon, phage, plasmid, and antimicrobial resistance-related functions, respectively). Finally, using a holdout set of 419 diverse E. coli genomes that were isolated from freshwater environmental sources, we observed an average per-genome F1 score of 0.880 [0.876–0.883, 95% CI], demonstrating the extensibility of the models. Overall, this study provides a framework for predicting variable gene content using a limited amount of input sequence data. IMPORTANCE: Having the ability to predict the protein-encoding gene content of a genome is important for assessing genome quality, binning genomes from shotgun metagenomic assemblies, and assessing risk due to the presence of antimicrobial resistance and other virulence genes. In this study, we built a set of binary classifiers for predicting the presence or absence of variable genes occurring in 10%–90% of all publicly available E. coli genomes. Overall, the results show that a large portion of the E. coli variable gene content can be predicted with high accuracy, including genes with functions relating to horizontal gene transfer. This study offers a strategy for predicting gene content using limited input sequence data. American Society for Microbiology 2023-06-14 /pmc/articles/PMC10469788/ /pubmed/37314210 http://dx.doi.org/10.1128/msystems.00058-23 Text en https://doi.org/10.1128/AuthorWarrantyLicense.v1This is a work of the U.S. Government and is not subject to copyright protection in the United States. Foreign copyrights may apply.
spellingShingle	Research Article Nguyen, Marcus Elmore, Zachary Ihle, Clay Moen, Francesco S. Slater, Adam D. Turner, Benjamin N. Parrello, Bruce Best, Aaron A. Davis, James J. Predicting variable gene content in Escherichia coli using conserved genes
title	Predicting variable gene content in Escherichia coli using conserved genes
title_full	Predicting variable gene content in Escherichia coli using conserved genes
title_fullStr	Predicting variable gene content in Escherichia coli using conserved genes
title_full_unstemmed	Predicting variable gene content in Escherichia coli using conserved genes
title_short	Predicting variable gene content in Escherichia coli using conserved genes
title_sort	predicting variable gene content in escherichia coli using conserved genes
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10469788/ https://www.ncbi.nlm.nih.gov/pubmed/37314210 http://dx.doi.org/10.1128/msystems.00058-23
work_keys_str_mv	AT nguyenmarcus predictingvariablegenecontentinescherichiacoliusingconservedgenes AT elmorezachary predictingvariablegenecontentinescherichiacoliusingconservedgenes AT ihleclay predictingvariablegenecontentinescherichiacoliusingconservedgenes AT moenfrancescos predictingvariablegenecontentinescherichiacoliusingconservedgenes AT slateradamd predictingvariablegenecontentinescherichiacoliusingconservedgenes AT turnerbenjaminn predictingvariablegenecontentinescherichiacoliusingconservedgenes AT parrellobruce predictingvariablegenecontentinescherichiacoliusingconservedgenes AT bestaarona predictingvariablegenecontentinescherichiacoliusingconservedgenes AT davisjamesj predictingvariablegenecontentinescherichiacoliusingconservedgenes

Predicting variable gene content in Escherichia coli using conserved genes

Ejemplares similares