Cargando…

A Framework for Effective Application of Machine Learning to Microbiome-Based Classification Problems

Machine learning (ML) modeling of the human microbiome has the potential to identify microbial biomarkers and aid in the diagnosis of many diseases such as inflammatory bowel disease, diabetes, and colorectal cancer. Progress has been made toward developing ML models that predict health outcomes usi...

Descripción completa

Detalles Bibliográficos
Autores principales:	Topçuoğlu, Begüm D., Lesniak, Nicholas A., Ruffin, Mack T., Wiens, Jenna, Schloss, Patrick D.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	American Society for Microbiology 2020
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7373189/ https://www.ncbi.nlm.nih.gov/pubmed/32518182 http://dx.doi.org/10.1128/mBio.00434-20

_version_	1783561455200632832
author	Topçuoğlu, Begüm D. Lesniak, Nicholas A. Ruffin, Mack T. Wiens, Jenna Schloss, Patrick D.
author_facet	Topçuoğlu, Begüm D. Lesniak, Nicholas A. Ruffin, Mack T. Wiens, Jenna Schloss, Patrick D.
author_sort	Topçuoğlu, Begüm D.
collection	PubMed
description	Machine learning (ML) modeling of the human microbiome has the potential to identify microbial biomarkers and aid in the diagnosis of many diseases such as inflammatory bowel disease, diabetes, and colorectal cancer. Progress has been made toward developing ML models that predict health outcomes using bacterial abundances, but inconsistent adoption of training and evaluation methods call the validity of these models into question. Furthermore, there appears to be a preference by many researchers to favor increased model complexity over interpretability. To overcome these challenges, we trained seven models that used fecal 16S rRNA sequence data to predict the presence of colonic screen relevant neoplasias (SRNs) (n = 490 patients, 261 controls and 229 cases). We developed a reusable open-source pipeline to train, validate, and interpret ML models. To show the effect of model selection, we assessed the predictive performance, interpretability, and training time of L2-regularized logistic regression, L1- and L2-regularized support vector machines (SVM) with linear and radial basis function kernels, a decision tree, random forest, and gradient boosted trees (XGBoost). The random forest model performed best at detecting SRNs with an area under the receiver operating characteristic curve (AUROC) of 0.695 (interquartile range [IQR], 0.651 to 0.739) but was slow to train (83.2 h) and not inherently interpretable. Despite its simplicity, L2-regularized logistic regression followed random forest in predictive performance with an AUROC of 0.680 (IQR, 0.625 to 0.735), trained faster (12 min), and was inherently interpretable. Our analysis highlights the importance of choosing an ML approach based on the goal of the study, as the choice will inform expectations of performance and interpretability.
format	Online Article Text
id	pubmed-7373189
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	American Society for Microbiology
record_format	MEDLINE/PubMed
spelling	pubmed-73731892020-07-24 A Framework for Effective Application of Machine Learning to Microbiome-Based Classification Problems Topçuoğlu, Begüm D. Lesniak, Nicholas A. Ruffin, Mack T. Wiens, Jenna Schloss, Patrick D. mBio Research Article Machine learning (ML) modeling of the human microbiome has the potential to identify microbial biomarkers and aid in the diagnosis of many diseases such as inflammatory bowel disease, diabetes, and colorectal cancer. Progress has been made toward developing ML models that predict health outcomes using bacterial abundances, but inconsistent adoption of training and evaluation methods call the validity of these models into question. Furthermore, there appears to be a preference by many researchers to favor increased model complexity over interpretability. To overcome these challenges, we trained seven models that used fecal 16S rRNA sequence data to predict the presence of colonic screen relevant neoplasias (SRNs) (n = 490 patients, 261 controls and 229 cases). We developed a reusable open-source pipeline to train, validate, and interpret ML models. To show the effect of model selection, we assessed the predictive performance, interpretability, and training time of L2-regularized logistic regression, L1- and L2-regularized support vector machines (SVM) with linear and radial basis function kernels, a decision tree, random forest, and gradient boosted trees (XGBoost). The random forest model performed best at detecting SRNs with an area under the receiver operating characteristic curve (AUROC) of 0.695 (interquartile range [IQR], 0.651 to 0.739) but was slow to train (83.2 h) and not inherently interpretable. Despite its simplicity, L2-regularized logistic regression followed random forest in predictive performance with an AUROC of 0.680 (IQR, 0.625 to 0.735), trained faster (12 min), and was inherently interpretable. Our analysis highlights the importance of choosing an ML approach based on the goal of the study, as the choice will inform expectations of performance and interpretability. American Society for Microbiology 2020-06-09 /pmc/articles/PMC7373189/ /pubmed/32518182 http://dx.doi.org/10.1128/mBio.00434-20 Text en Copyright © 2020 Topçuoğlu et al. https://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International license (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle	Research Article Topçuoğlu, Begüm D. Lesniak, Nicholas A. Ruffin, Mack T. Wiens, Jenna Schloss, Patrick D. A Framework for Effective Application of Machine Learning to Microbiome-Based Classification Problems
title	A Framework for Effective Application of Machine Learning to Microbiome-Based Classification Problems
title_full	A Framework for Effective Application of Machine Learning to Microbiome-Based Classification Problems
title_fullStr	A Framework for Effective Application of Machine Learning to Microbiome-Based Classification Problems
title_full_unstemmed	A Framework for Effective Application of Machine Learning to Microbiome-Based Classification Problems
title_short	A Framework for Effective Application of Machine Learning to Microbiome-Based Classification Problems
title_sort	framework for effective application of machine learning to microbiome-based classification problems
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7373189/ https://www.ncbi.nlm.nih.gov/pubmed/32518182 http://dx.doi.org/10.1128/mBio.00434-20
work_keys_str_mv	AT topcuoglubegumd aframeworkforeffectiveapplicationofmachinelearningtomicrobiomebasedclassificationproblems AT lesniaknicholasa aframeworkforeffectiveapplicationofmachinelearningtomicrobiomebasedclassificationproblems AT ruffinmackt aframeworkforeffectiveapplicationofmachinelearningtomicrobiomebasedclassificationproblems AT wiensjenna aframeworkforeffectiveapplicationofmachinelearningtomicrobiomebasedclassificationproblems AT schlosspatrickd aframeworkforeffectiveapplicationofmachinelearningtomicrobiomebasedclassificationproblems AT topcuoglubegumd frameworkforeffectiveapplicationofmachinelearningtomicrobiomebasedclassificationproblems AT lesniaknicholasa frameworkforeffectiveapplicationofmachinelearningtomicrobiomebasedclassificationproblems AT ruffinmackt frameworkforeffectiveapplicationofmachinelearningtomicrobiomebasedclassificationproblems AT wiensjenna frameworkforeffectiveapplicationofmachinelearningtomicrobiomebasedclassificationproblems AT schlosspatrickd frameworkforeffectiveapplicationofmachinelearningtomicrobiomebasedclassificationproblems

A Framework for Effective Application of Machine Learning to Microbiome-Based Classification Problems

Ejemplares similares