Cargando…

Model selection for metabolomics: predicting diagnosis of coronary artery disease using automated machine learning

MOTIVATION: Selecting the optimal machine learning (ML) model for a given dataset is often challenging. Automated ML (AutoML) has emerged as a powerful tool for enabling the automatic selection of ML methods and parameter settings for the prediction of biomedical endpoints. Here, we apply the tree-b...

Descripción completa

Detalles Bibliográficos
Autores principales: Orlenko, Alena, Kofink, Daniel, Lyytikäinen, Leo-Pekka, Nikus, Kjell, Mishra, Pashupati, Kuukasjärvi, Pekka, Karhunen, Pekka J, Kähönen, Mika, Laurikka, Jari O, Lehtimäki, Terho, Asselbergs, Folkert W, Moore, Jason H
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7703753/
https://www.ncbi.nlm.nih.gov/pubmed/31702773
http://dx.doi.org/10.1093/bioinformatics/btz796
Descripción
Sumario:MOTIVATION: Selecting the optimal machine learning (ML) model for a given dataset is often challenging. Automated ML (AutoML) has emerged as a powerful tool for enabling the automatic selection of ML methods and parameter settings for the prediction of biomedical endpoints. Here, we apply the tree-based pipeline optimization tool (TPOT) to predict angiographic diagnoses of coronary artery disease (CAD). With TPOT, ML models are represented as expression trees and optimal pipelines discovered using a stochastic search method called genetic programing. We provide some guidelines for TPOT-based ML pipeline selection and optimization-based on various clinical phenotypes and high-throughput metabolic profiles in the Angiography and Genes Study (ANGES). RESULTS: We analyzed nuclear magnetic resonance-derived lipoprotein and metabolite profiles in the ANGES cohort with a goal to identify the role of non-obstructive CAD patients in CAD diagnostics. We performed a comparative analysis of TPOT-generated ML pipelines with selected ML classifiers, optimized with a grid search approach, applied to two phenotypic CAD profiles. As a result, TPOT-generated ML pipelines that outperformed grid search optimized models across multiple performance metrics including balanced accuracy and area under the precision-recall curve. With the selected models, we demonstrated that the phenotypic profile that distinguishes non-obstructive CAD patients from no CAD patients is associated with higher precision, suggesting a discrepancy in the underlying processes between these phenotypes. AVAILABILITY AND IMPLEMENTATION: TPOT is freely available via http://epistasislab.github.io/tpot/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.