Cargando…

Benchmark Dataset for Training Machine Learning Models to Predict the Pathway Involvement of Metabolites

Metabolic pathways are a human-defined grouping of life sustaining biochemical reactions, metabolites being both the reactants and products of these reactions. But many public datasets include identified metabolites whose pathway involvement is unknown, hindering metabolic interpretation. To address...

Descripción completa

Detalles Bibliográficos
Autores principales:	Huckvale, Erik D., Powell, Christian D., Jin, Huan, Moseley, Hunter N. B.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2023
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10673125/ https://www.ncbi.nlm.nih.gov/pubmed/37999216 http://dx.doi.org/10.3390/metabo13111120

_version_	1785140549121998848
author	Huckvale, Erik D. Powell, Christian D. Jin, Huan Moseley, Hunter N. B.
author_facet	Huckvale, Erik D. Powell, Christian D. Jin, Huan Moseley, Hunter N. B.
author_sort	Huckvale, Erik D.
collection	PubMed
description	Metabolic pathways are a human-defined grouping of life sustaining biochemical reactions, metabolites being both the reactants and products of these reactions. But many public datasets include identified metabolites whose pathway involvement is unknown, hindering metabolic interpretation. To address these shortcomings, various machine learning models, including those trained on data from the Kyoto Encyclopedia of Genes and Genomes (KEGG), have been developed to predict the pathway involvement of metabolites based on their chemical descriptions; however, these prior models are based on old metabolite KEGG-based datasets, including one benchmark dataset that is invalid due to the presence of over 1500 duplicate entries. Therefore, we have developed a new benchmark dataset derived from the KEGG following optimal standards of scientific computational reproducibility and including all source code needed to update the benchmark dataset as KEGG changes. We have used this new benchmark dataset with our atom coloring methodology to develop and compare the performance of Random Forest, XGBoost, and multilayer perceptron with autoencoder models generated from our new benchmark dataset. Best overall weighted average performance across 1000 unique folds was an F1 score of 0.8180 and a Matthews correlation coefficient of 0.7933, which was provided by XGBoost binary classification models for 11 KEGG-defined pathway categories.
format	Online Article Text
id	pubmed-10673125
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-106731252023-11-01 Benchmark Dataset for Training Machine Learning Models to Predict the Pathway Involvement of Metabolites Huckvale, Erik D. Powell, Christian D. Jin, Huan Moseley, Hunter N. B. Metabolites Article Metabolic pathways are a human-defined grouping of life sustaining biochemical reactions, metabolites being both the reactants and products of these reactions. But many public datasets include identified metabolites whose pathway involvement is unknown, hindering metabolic interpretation. To address these shortcomings, various machine learning models, including those trained on data from the Kyoto Encyclopedia of Genes and Genomes (KEGG), have been developed to predict the pathway involvement of metabolites based on their chemical descriptions; however, these prior models are based on old metabolite KEGG-based datasets, including one benchmark dataset that is invalid due to the presence of over 1500 duplicate entries. Therefore, we have developed a new benchmark dataset derived from the KEGG following optimal standards of scientific computational reproducibility and including all source code needed to update the benchmark dataset as KEGG changes. We have used this new benchmark dataset with our atom coloring methodology to develop and compare the performance of Random Forest, XGBoost, and multilayer perceptron with autoencoder models generated from our new benchmark dataset. Best overall weighted average performance across 1000 unique folds was an F1 score of 0.8180 and a Matthews correlation coefficient of 0.7933, which was provided by XGBoost binary classification models for 11 KEGG-defined pathway categories. MDPI 2023-11-01 /pmc/articles/PMC10673125/ /pubmed/37999216 http://dx.doi.org/10.3390/metabo13111120 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Huckvale, Erik D. Powell, Christian D. Jin, Huan Moseley, Hunter N. B. Benchmark Dataset for Training Machine Learning Models to Predict the Pathway Involvement of Metabolites
title	Benchmark Dataset for Training Machine Learning Models to Predict the Pathway Involvement of Metabolites
title_full	Benchmark Dataset for Training Machine Learning Models to Predict the Pathway Involvement of Metabolites
title_fullStr	Benchmark Dataset for Training Machine Learning Models to Predict the Pathway Involvement of Metabolites
title_full_unstemmed	Benchmark Dataset for Training Machine Learning Models to Predict the Pathway Involvement of Metabolites
title_short	Benchmark Dataset for Training Machine Learning Models to Predict the Pathway Involvement of Metabolites
title_sort	benchmark dataset for training machine learning models to predict the pathway involvement of metabolites
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10673125/ https://www.ncbi.nlm.nih.gov/pubmed/37999216 http://dx.doi.org/10.3390/metabo13111120
work_keys_str_mv	AT huckvaleerikd benchmarkdatasetfortrainingmachinelearningmodelstopredictthepathwayinvolvementofmetabolites AT powellchristiand benchmarkdatasetfortrainingmachinelearningmodelstopredictthepathwayinvolvementofmetabolites AT jinhuan benchmarkdatasetfortrainingmachinelearningmodelstopredictthepathwayinvolvementofmetabolites AT moseleyhunternb benchmarkdatasetfortrainingmachinelearningmodelstopredictthepathwayinvolvementofmetabolites

Benchmark Dataset for Training Machine Learning Models to Predict the Pathway Involvement of Metabolites

Ejemplares similares