Cargando…

Benchmark dataset for training machine learning models to predict the pathway involvement of metabolites

Metabolic pathways are a human-defined grouping of life sustaining biochemical reactions, metabolites being both the reactants and products of these reactions. But many public datasets include identified metabolites whose pathway involvement is unknown, hindering metabolic interpretation. To address...

Descripción completa

Detalles Bibliográficos
Autores principales: Huckvale, Erik D., Powell, Christian D., Jin, Huan, Moseley, Hunter N.B.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cold Spring Harbor Laboratory 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10592640/
https://www.ncbi.nlm.nih.gov/pubmed/37873272
http://dx.doi.org/10.1101/2023.10.03.560715
_version_ 1785124323017621504
author Huckvale, Erik D.
Powell, Christian D.
Jin, Huan
Moseley, Hunter N.B.
author_facet Huckvale, Erik D.
Powell, Christian D.
Jin, Huan
Moseley, Hunter N.B.
author_sort Huckvale, Erik D.
collection PubMed
description Metabolic pathways are a human-defined grouping of life sustaining biochemical reactions, metabolites being both the reactants and products of these reactions. But many public datasets include identified metabolites whose pathway involvement is unknown, hindering metabolic interpretation. To address these shortcomings, various machine learning models, including those trained on data from the Kyoto Encyclopedia of Genes and Genomes (KEGG), have been developed to predict the pathway involvement of metabolites based on their chemical descriptions; however, these prior models are based on old metabolite KEGG-based datasets, including one benchmark dataset that is invalid due to the presence of over 1500 duplicate entries. Therefore, we have developed a new benchmark dataset derived from the KEGG following optimal standards of scientific computational reproducibility and including all source code needed to update the benchmark dataset as KEGG changes. We have used this new benchmark dataset with our atom coloring methodology to develop and compare the performance of Random Forest, XGBoost, and multilayer perceptron with autoencoder models generated from our new benchmark dataset. Best overall weighted average performance across 1000 unique folds was an F1-score of 0.8180 and Matthews correlation coefficient of 0.7933, which was provided by XGBoost binary classification models for 11 KEGG-defined pathway categories.
format Online
Article
Text
id pubmed-10592640
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Cold Spring Harbor Laboratory
record_format MEDLINE/PubMed
spelling pubmed-105926402023-10-24 Benchmark dataset for training machine learning models to predict the pathway involvement of metabolites Huckvale, Erik D. Powell, Christian D. Jin, Huan Moseley, Hunter N.B. bioRxiv Article Metabolic pathways are a human-defined grouping of life sustaining biochemical reactions, metabolites being both the reactants and products of these reactions. But many public datasets include identified metabolites whose pathway involvement is unknown, hindering metabolic interpretation. To address these shortcomings, various machine learning models, including those trained on data from the Kyoto Encyclopedia of Genes and Genomes (KEGG), have been developed to predict the pathway involvement of metabolites based on their chemical descriptions; however, these prior models are based on old metabolite KEGG-based datasets, including one benchmark dataset that is invalid due to the presence of over 1500 duplicate entries. Therefore, we have developed a new benchmark dataset derived from the KEGG following optimal standards of scientific computational reproducibility and including all source code needed to update the benchmark dataset as KEGG changes. We have used this new benchmark dataset with our atom coloring methodology to develop and compare the performance of Random Forest, XGBoost, and multilayer perceptron with autoencoder models generated from our new benchmark dataset. Best overall weighted average performance across 1000 unique folds was an F1-score of 0.8180 and Matthews correlation coefficient of 0.7933, which was provided by XGBoost binary classification models for 11 KEGG-defined pathway categories. Cold Spring Harbor Laboratory 2023-10-09 /pmc/articles/PMC10592640/ /pubmed/37873272 http://dx.doi.org/10.1101/2023.10.03.560715 Text en https://creativecommons.org/licenses/by/4.0/This work is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/) , which allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use.
spellingShingle Article
Huckvale, Erik D.
Powell, Christian D.
Jin, Huan
Moseley, Hunter N.B.
Benchmark dataset for training machine learning models to predict the pathway involvement of metabolites
title Benchmark dataset for training machine learning models to predict the pathway involvement of metabolites
title_full Benchmark dataset for training machine learning models to predict the pathway involvement of metabolites
title_fullStr Benchmark dataset for training machine learning models to predict the pathway involvement of metabolites
title_full_unstemmed Benchmark dataset for training machine learning models to predict the pathway involvement of metabolites
title_short Benchmark dataset for training machine learning models to predict the pathway involvement of metabolites
title_sort benchmark dataset for training machine learning models to predict the pathway involvement of metabolites
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10592640/
https://www.ncbi.nlm.nih.gov/pubmed/37873272
http://dx.doi.org/10.1101/2023.10.03.560715
work_keys_str_mv AT huckvaleerikd benchmarkdatasetfortrainingmachinelearningmodelstopredictthepathwayinvolvementofmetabolites
AT powellchristiand benchmarkdatasetfortrainingmachinelearningmodelstopredictthepathwayinvolvementofmetabolites
AT jinhuan benchmarkdatasetfortrainingmachinelearningmodelstopredictthepathwayinvolvementofmetabolites
AT moseleyhunternb benchmarkdatasetfortrainingmachinelearningmodelstopredictthepathwayinvolvementofmetabolites