Cargando…

Harmonization of supervised machine learning practices for efficient source attribution of Listeria monocytogenes based on genomic data

BACKGROUND: Genomic data-based machine learning tools are promising for real-time surveillance activities performing source attribution of foodborne bacteria such as Listeria monocytogenes. Given the heterogeneity of machine learning practices, our aim was to identify those influencing the source pr...

Descripción completa

Detalles Bibliográficos
Autores principales: Castelli, Pierluigi, De Ruvo, Andrea, Bucciacchio, Andrea, D’Alterio, Nicola, Cammà, Cesare, Di Pasquale, Adriano, Radomski, Nicolas
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10515079/
https://www.ncbi.nlm.nih.gov/pubmed/37736708
http://dx.doi.org/10.1186/s12864-023-09667-w
_version_ 1785108870482362368
author Castelli, Pierluigi
De Ruvo, Andrea
Bucciacchio, Andrea
D’Alterio, Nicola
Cammà, Cesare
Di Pasquale, Adriano
Radomski, Nicolas
author_facet Castelli, Pierluigi
De Ruvo, Andrea
Bucciacchio, Andrea
D’Alterio, Nicola
Cammà, Cesare
Di Pasquale, Adriano
Radomski, Nicolas
author_sort Castelli, Pierluigi
collection PubMed
description BACKGROUND: Genomic data-based machine learning tools are promising for real-time surveillance activities performing source attribution of foodborne bacteria such as Listeria monocytogenes. Given the heterogeneity of machine learning practices, our aim was to identify those influencing the source prediction performance of the usual holdout method combined with the repeated k-fold cross-validation method. METHODS: A large collection of 1 100 L. monocytogenes genomes with known sources was built according to several genomic metrics to ensure authenticity and completeness of genomic profiles. Based on these genomic profiles (i.e. 7-locus alleles, core alleles, accessory genes, core SNPs and pan kmers), we developed a versatile workflow assessing prediction performance of different combinations of training dataset splitting (i.e. 50, 60, 70, 80 and 90%), data preprocessing (i.e. with or without near-zero variance removal), and learning models (i.e. BLR, ERT, RF, SGB, SVM and XGB). The performance metrics included accuracy, Cohen’s kappa, F1-score, area under the curves from receiver operating characteristic curve, precision recall curve or precision recall gain curve, and execution time. RESULTS: The testing average accuracies from accessory genes and pan kmers were significantly higher than accuracies from core alleles or SNPs. While the accuracies from 70 and 80% of training dataset splitting were not significantly different, those from 80% were significantly higher than the other tested proportions. The near-zero variance removal did not allow to produce results for 7-locus alleles, did not impact significantly the accuracy for core alleles, accessory genes and pan kmers, and decreased significantly accuracy for core SNPs. The SVM and XGB models did not present significant differences in accuracy between each other and reached significantly higher accuracies than BLR, SGB, ERT and RF, in this order of magnitude. However, the SVM model required more computing power than the XGB model, especially for high amount of descriptors such like core SNPs and pan kmers. CONCLUSIONS: In addition to recommendations about machine learning practices for L. monocytogenes source attribution based on genomic data, the present study also provides a freely available workflow to solve other balanced or unbalanced multiclass phenotypes from binary and categorical genomic profiles of other microorganisms without source code modifications. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12864-023-09667-w.
format Online
Article
Text
id pubmed-10515079
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-105150792023-09-23 Harmonization of supervised machine learning practices for efficient source attribution of Listeria monocytogenes based on genomic data Castelli, Pierluigi De Ruvo, Andrea Bucciacchio, Andrea D’Alterio, Nicola Cammà, Cesare Di Pasquale, Adriano Radomski, Nicolas BMC Genomics Research BACKGROUND: Genomic data-based machine learning tools are promising for real-time surveillance activities performing source attribution of foodborne bacteria such as Listeria monocytogenes. Given the heterogeneity of machine learning practices, our aim was to identify those influencing the source prediction performance of the usual holdout method combined with the repeated k-fold cross-validation method. METHODS: A large collection of 1 100 L. monocytogenes genomes with known sources was built according to several genomic metrics to ensure authenticity and completeness of genomic profiles. Based on these genomic profiles (i.e. 7-locus alleles, core alleles, accessory genes, core SNPs and pan kmers), we developed a versatile workflow assessing prediction performance of different combinations of training dataset splitting (i.e. 50, 60, 70, 80 and 90%), data preprocessing (i.e. with or without near-zero variance removal), and learning models (i.e. BLR, ERT, RF, SGB, SVM and XGB). The performance metrics included accuracy, Cohen’s kappa, F1-score, area under the curves from receiver operating characteristic curve, precision recall curve or precision recall gain curve, and execution time. RESULTS: The testing average accuracies from accessory genes and pan kmers were significantly higher than accuracies from core alleles or SNPs. While the accuracies from 70 and 80% of training dataset splitting were not significantly different, those from 80% were significantly higher than the other tested proportions. The near-zero variance removal did not allow to produce results for 7-locus alleles, did not impact significantly the accuracy for core alleles, accessory genes and pan kmers, and decreased significantly accuracy for core SNPs. The SVM and XGB models did not present significant differences in accuracy between each other and reached significantly higher accuracies than BLR, SGB, ERT and RF, in this order of magnitude. However, the SVM model required more computing power than the XGB model, especially for high amount of descriptors such like core SNPs and pan kmers. CONCLUSIONS: In addition to recommendations about machine learning practices for L. monocytogenes source attribution based on genomic data, the present study also provides a freely available workflow to solve other balanced or unbalanced multiclass phenotypes from binary and categorical genomic profiles of other microorganisms without source code modifications. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12864-023-09667-w. BioMed Central 2023-09-22 /pmc/articles/PMC10515079/ /pubmed/37736708 http://dx.doi.org/10.1186/s12864-023-09667-w Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Castelli, Pierluigi
De Ruvo, Andrea
Bucciacchio, Andrea
D’Alterio, Nicola
Cammà, Cesare
Di Pasquale, Adriano
Radomski, Nicolas
Harmonization of supervised machine learning practices for efficient source attribution of Listeria monocytogenes based on genomic data
title Harmonization of supervised machine learning practices for efficient source attribution of Listeria monocytogenes based on genomic data
title_full Harmonization of supervised machine learning practices for efficient source attribution of Listeria monocytogenes based on genomic data
title_fullStr Harmonization of supervised machine learning practices for efficient source attribution of Listeria monocytogenes based on genomic data
title_full_unstemmed Harmonization of supervised machine learning practices for efficient source attribution of Listeria monocytogenes based on genomic data
title_short Harmonization of supervised machine learning practices for efficient source attribution of Listeria monocytogenes based on genomic data
title_sort harmonization of supervised machine learning practices for efficient source attribution of listeria monocytogenes based on genomic data
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10515079/
https://www.ncbi.nlm.nih.gov/pubmed/37736708
http://dx.doi.org/10.1186/s12864-023-09667-w
work_keys_str_mv AT castellipierluigi harmonizationofsupervisedmachinelearningpracticesforefficientsourceattributionoflisteriamonocytogenesbasedongenomicdata
AT deruvoandrea harmonizationofsupervisedmachinelearningpracticesforefficientsourceattributionoflisteriamonocytogenesbasedongenomicdata
AT bucciacchioandrea harmonizationofsupervisedmachinelearningpracticesforefficientsourceattributionoflisteriamonocytogenesbasedongenomicdata
AT dalterionicola harmonizationofsupervisedmachinelearningpracticesforefficientsourceattributionoflisteriamonocytogenesbasedongenomicdata
AT cammacesare harmonizationofsupervisedmachinelearningpracticesforefficientsourceattributionoflisteriamonocytogenesbasedongenomicdata
AT dipasqualeadriano harmonizationofsupervisedmachinelearningpracticesforefficientsourceattributionoflisteriamonocytogenesbasedongenomicdata
AT radomskinicolas harmonizationofsupervisedmachinelearningpracticesforefficientsourceattributionoflisteriamonocytogenesbasedongenomicdata