Cargando…
Combination of whole genome sequencing and supervised machine learning provides unambiguous identification of eae-positive Shiga toxin-producing Escherichia coli
INTRODUCTION: The objective of this study was to develop, using a genome wide machine learning approach, an unambiguous model to predict the presence of highly pathogenic STEC in E. coli reads assemblies derived from complex samples containing potentially multiple E. coli strains. Our approach has t...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Frontiers Media S.A.
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10213463/ https://www.ncbi.nlm.nih.gov/pubmed/37250024 http://dx.doi.org/10.3389/fmicb.2023.1118158 |
_version_ | 1785047629812465664 |
---|---|
author | Vorimore, Fabien Jaudou, Sandra Tran, Mai-Lan Richard, Hugues Fach, Patrick Delannoy, Sabine |
author_facet | Vorimore, Fabien Jaudou, Sandra Tran, Mai-Lan Richard, Hugues Fach, Patrick Delannoy, Sabine |
author_sort | Vorimore, Fabien |
collection | PubMed |
description | INTRODUCTION: The objective of this study was to develop, using a genome wide machine learning approach, an unambiguous model to predict the presence of highly pathogenic STEC in E. coli reads assemblies derived from complex samples containing potentially multiple E. coli strains. Our approach has taken into account the high genomic plasticity of E. coli and utilized the stratification of STEC and E. coli pathogroups classification based on the serotype and virulence factors to identify specific combinations of biomarkers for improved characterization of eae-positive STEC (also named EHEC for enterohemorrhagic E.coli) which are associated with bloody diarrhea and hemolytic uremic syndrome (HUS) in human. METHODS: The Machine Learning (ML) approach was used in this study on a large curated dataset composed of 1,493 E. coli genome sequences and 1,178 Coding Sequences (CDS). Feature selection has been performed using eight classification algorithms, resulting in a reduction of the number of CDS to six. From this reduced dataset, the eight ML models were trained with hyper-parameter tuning and cross-validation steps. RESULTS AND DISCUSSION: It is remarkable that only using these six genes, EHEC can be clearly identified from E. coli read assemblies obtained from in silico mixtures and complex samples such as milk metagenomes. These various combinations of discriminative biomarkers can be implemented as novel marker genes for the unambiguous EHEC characterization from different E. coli strains mixtures as well as from raw milk metagenomes. |
format | Online Article Text |
id | pubmed-10213463 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Frontiers Media S.A. |
record_format | MEDLINE/PubMed |
spelling | pubmed-102134632023-05-27 Combination of whole genome sequencing and supervised machine learning provides unambiguous identification of eae-positive Shiga toxin-producing Escherichia coli Vorimore, Fabien Jaudou, Sandra Tran, Mai-Lan Richard, Hugues Fach, Patrick Delannoy, Sabine Front Microbiol Microbiology INTRODUCTION: The objective of this study was to develop, using a genome wide machine learning approach, an unambiguous model to predict the presence of highly pathogenic STEC in E. coli reads assemblies derived from complex samples containing potentially multiple E. coli strains. Our approach has taken into account the high genomic plasticity of E. coli and utilized the stratification of STEC and E. coli pathogroups classification based on the serotype and virulence factors to identify specific combinations of biomarkers for improved characterization of eae-positive STEC (also named EHEC for enterohemorrhagic E.coli) which are associated with bloody diarrhea and hemolytic uremic syndrome (HUS) in human. METHODS: The Machine Learning (ML) approach was used in this study on a large curated dataset composed of 1,493 E. coli genome sequences and 1,178 Coding Sequences (CDS). Feature selection has been performed using eight classification algorithms, resulting in a reduction of the number of CDS to six. From this reduced dataset, the eight ML models were trained with hyper-parameter tuning and cross-validation steps. RESULTS AND DISCUSSION: It is remarkable that only using these six genes, EHEC can be clearly identified from E. coli read assemblies obtained from in silico mixtures and complex samples such as milk metagenomes. These various combinations of discriminative biomarkers can be implemented as novel marker genes for the unambiguous EHEC characterization from different E. coli strains mixtures as well as from raw milk metagenomes. Frontiers Media S.A. 2023-05-12 /pmc/articles/PMC10213463/ /pubmed/37250024 http://dx.doi.org/10.3389/fmicb.2023.1118158 Text en Copyright © 2023 Vorimore, Jaudou, Tran, Richard, Fach and Delannoy. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms. |
spellingShingle | Microbiology Vorimore, Fabien Jaudou, Sandra Tran, Mai-Lan Richard, Hugues Fach, Patrick Delannoy, Sabine Combination of whole genome sequencing and supervised machine learning provides unambiguous identification of eae-positive Shiga toxin-producing Escherichia coli |
title | Combination of whole genome sequencing and supervised machine learning provides unambiguous identification of eae-positive Shiga toxin-producing Escherichia coli |
title_full | Combination of whole genome sequencing and supervised machine learning provides unambiguous identification of eae-positive Shiga toxin-producing Escherichia coli |
title_fullStr | Combination of whole genome sequencing and supervised machine learning provides unambiguous identification of eae-positive Shiga toxin-producing Escherichia coli |
title_full_unstemmed | Combination of whole genome sequencing and supervised machine learning provides unambiguous identification of eae-positive Shiga toxin-producing Escherichia coli |
title_short | Combination of whole genome sequencing and supervised machine learning provides unambiguous identification of eae-positive Shiga toxin-producing Escherichia coli |
title_sort | combination of whole genome sequencing and supervised machine learning provides unambiguous identification of eae-positive shiga toxin-producing escherichia coli |
topic | Microbiology |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10213463/ https://www.ncbi.nlm.nih.gov/pubmed/37250024 http://dx.doi.org/10.3389/fmicb.2023.1118158 |
work_keys_str_mv | AT vorimorefabien combinationofwholegenomesequencingandsupervisedmachinelearningprovidesunambiguousidentificationofeaepositiveshigatoxinproducingescherichiacoli AT jaudousandra combinationofwholegenomesequencingandsupervisedmachinelearningprovidesunambiguousidentificationofeaepositiveshigatoxinproducingescherichiacoli AT tranmailan combinationofwholegenomesequencingandsupervisedmachinelearningprovidesunambiguousidentificationofeaepositiveshigatoxinproducingescherichiacoli AT richardhugues combinationofwholegenomesequencingandsupervisedmachinelearningprovidesunambiguousidentificationofeaepositiveshigatoxinproducingescherichiacoli AT fachpatrick combinationofwholegenomesequencingandsupervisedmachinelearningprovidesunambiguousidentificationofeaepositiveshigatoxinproducingescherichiacoli AT delannoysabine combinationofwholegenomesequencingandsupervisedmachinelearningprovidesunambiguousidentificationofeaepositiveshigatoxinproducingescherichiacoli |