Cargando…

Combination of whole genome sequencing and supervised machine learning provides unambiguous identification of eae-positive Shiga toxin-producing Escherichia coli

INTRODUCTION: The objective of this study was to develop, using a genome wide machine learning approach, an unambiguous model to predict the presence of highly pathogenic STEC in E. coli reads assemblies derived from complex samples containing potentially multiple E. coli strains. Our approach has t...

Descripción completa

Detalles Bibliográficos
Autores principales: Vorimore, Fabien, Jaudou, Sandra, Tran, Mai-Lan, Richard, Hugues, Fach, Patrick, Delannoy, Sabine
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10213463/
https://www.ncbi.nlm.nih.gov/pubmed/37250024
http://dx.doi.org/10.3389/fmicb.2023.1118158
_version_ 1785047629812465664
author Vorimore, Fabien
Jaudou, Sandra
Tran, Mai-Lan
Richard, Hugues
Fach, Patrick
Delannoy, Sabine
author_facet Vorimore, Fabien
Jaudou, Sandra
Tran, Mai-Lan
Richard, Hugues
Fach, Patrick
Delannoy, Sabine
author_sort Vorimore, Fabien
collection PubMed
description INTRODUCTION: The objective of this study was to develop, using a genome wide machine learning approach, an unambiguous model to predict the presence of highly pathogenic STEC in E. coli reads assemblies derived from complex samples containing potentially multiple E. coli strains. Our approach has taken into account the high genomic plasticity of E. coli and utilized the stratification of STEC and E. coli pathogroups classification based on the serotype and virulence factors to identify specific combinations of biomarkers for improved characterization of eae-positive STEC (also named EHEC for enterohemorrhagic E.coli) which are associated with bloody diarrhea and hemolytic uremic syndrome (HUS) in human. METHODS: The Machine Learning (ML) approach was used in this study on a large curated dataset composed of 1,493 E. coli genome sequences and 1,178 Coding Sequences (CDS). Feature selection has been performed using eight classification algorithms, resulting in a reduction of the number of CDS to six. From this reduced dataset, the eight ML models were trained with hyper-parameter tuning and cross-validation steps. RESULTS AND DISCUSSION: It is remarkable that only using these six genes, EHEC can be clearly identified from E. coli read assemblies obtained from in silico mixtures and complex samples such as milk metagenomes. These various combinations of discriminative biomarkers can be implemented as novel marker genes for the unambiguous EHEC characterization from different E. coli strains mixtures as well as from raw milk metagenomes.
format Online
Article
Text
id pubmed-10213463
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-102134632023-05-27 Combination of whole genome sequencing and supervised machine learning provides unambiguous identification of eae-positive Shiga toxin-producing Escherichia coli Vorimore, Fabien Jaudou, Sandra Tran, Mai-Lan Richard, Hugues Fach, Patrick Delannoy, Sabine Front Microbiol Microbiology INTRODUCTION: The objective of this study was to develop, using a genome wide machine learning approach, an unambiguous model to predict the presence of highly pathogenic STEC in E. coli reads assemblies derived from complex samples containing potentially multiple E. coli strains. Our approach has taken into account the high genomic plasticity of E. coli and utilized the stratification of STEC and E. coli pathogroups classification based on the serotype and virulence factors to identify specific combinations of biomarkers for improved characterization of eae-positive STEC (also named EHEC for enterohemorrhagic E.coli) which are associated with bloody diarrhea and hemolytic uremic syndrome (HUS) in human. METHODS: The Machine Learning (ML) approach was used in this study on a large curated dataset composed of 1,493 E. coli genome sequences and 1,178 Coding Sequences (CDS). Feature selection has been performed using eight classification algorithms, resulting in a reduction of the number of CDS to six. From this reduced dataset, the eight ML models were trained with hyper-parameter tuning and cross-validation steps. RESULTS AND DISCUSSION: It is remarkable that only using these six genes, EHEC can be clearly identified from E. coli read assemblies obtained from in silico mixtures and complex samples such as milk metagenomes. These various combinations of discriminative biomarkers can be implemented as novel marker genes for the unambiguous EHEC characterization from different E. coli strains mixtures as well as from raw milk metagenomes. Frontiers Media S.A. 2023-05-12 /pmc/articles/PMC10213463/ /pubmed/37250024 http://dx.doi.org/10.3389/fmicb.2023.1118158 Text en Copyright © 2023 Vorimore, Jaudou, Tran, Richard, Fach and Delannoy. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Microbiology
Vorimore, Fabien
Jaudou, Sandra
Tran, Mai-Lan
Richard, Hugues
Fach, Patrick
Delannoy, Sabine
Combination of whole genome sequencing and supervised machine learning provides unambiguous identification of eae-positive Shiga toxin-producing Escherichia coli
title Combination of whole genome sequencing and supervised machine learning provides unambiguous identification of eae-positive Shiga toxin-producing Escherichia coli
title_full Combination of whole genome sequencing and supervised machine learning provides unambiguous identification of eae-positive Shiga toxin-producing Escherichia coli
title_fullStr Combination of whole genome sequencing and supervised machine learning provides unambiguous identification of eae-positive Shiga toxin-producing Escherichia coli
title_full_unstemmed Combination of whole genome sequencing and supervised machine learning provides unambiguous identification of eae-positive Shiga toxin-producing Escherichia coli
title_short Combination of whole genome sequencing and supervised machine learning provides unambiguous identification of eae-positive Shiga toxin-producing Escherichia coli
title_sort combination of whole genome sequencing and supervised machine learning provides unambiguous identification of eae-positive shiga toxin-producing escherichia coli
topic Microbiology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10213463/
https://www.ncbi.nlm.nih.gov/pubmed/37250024
http://dx.doi.org/10.3389/fmicb.2023.1118158
work_keys_str_mv AT vorimorefabien combinationofwholegenomesequencingandsupervisedmachinelearningprovidesunambiguousidentificationofeaepositiveshigatoxinproducingescherichiacoli
AT jaudousandra combinationofwholegenomesequencingandsupervisedmachinelearningprovidesunambiguousidentificationofeaepositiveshigatoxinproducingescherichiacoli
AT tranmailan combinationofwholegenomesequencingandsupervisedmachinelearningprovidesunambiguousidentificationofeaepositiveshigatoxinproducingescherichiacoli
AT richardhugues combinationofwholegenomesequencingandsupervisedmachinelearningprovidesunambiguousidentificationofeaepositiveshigatoxinproducingescherichiacoli
AT fachpatrick combinationofwholegenomesequencingandsupervisedmachinelearningprovidesunambiguousidentificationofeaepositiveshigatoxinproducingescherichiacoli
AT delannoysabine combinationofwholegenomesequencingandsupervisedmachinelearningprovidesunambiguousidentificationofeaepositiveshigatoxinproducingescherichiacoli