Cargando…

Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses

BACKGROUND: A typical task in bioinformatics consists of identifying which features are associated with a target outcome of interest and building a predictive model. Automated machine learning (AutoML) systems such as the Tree-based Pipeline Optimization Tool (TPOT) constitute an appealing approach...

Descripción completa

Detalles Bibliográficos
Autores principales: Manduchi, Elisabetta, Fu, Weixuan, Romano, Joseph D., Ruberto, Stefano, Moore, Jason H.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7528347/
https://www.ncbi.nlm.nih.gov/pubmed/32998684
http://dx.doi.org/10.1186/s12859-020-03755-4
_version_ 1783589244854337536
author Manduchi, Elisabetta
Fu, Weixuan
Romano, Joseph D.
Ruberto, Stefano
Moore, Jason H.
author_facet Manduchi, Elisabetta
Fu, Weixuan
Romano, Joseph D.
Ruberto, Stefano
Moore, Jason H.
author_sort Manduchi, Elisabetta
collection PubMed
description BACKGROUND: A typical task in bioinformatics consists of identifying which features are associated with a target outcome of interest and building a predictive model. Automated machine learning (AutoML) systems such as the Tree-based Pipeline Optimization Tool (TPOT) constitute an appealing approach to this end. However, in biomedical data, there are often baseline characteristics of the subjects in a study or batch effects that need to be adjusted for in order to better isolate the effects of the features of interest on the target. Thus, the ability to perform covariate adjustments becomes particularly important for applications of AutoML to biomedical big data analysis. RESULTS: We developed an approach to adjust for covariates affecting features and/or target in TPOT. Our approach is based on regressing out the covariates in a manner that avoids ‘leakage’ during the cross-validation training procedure. We describe applications of this approach to toxicogenomics and schizophrenia gene expression data sets. The TPOT extensions discussed in this work are available at https://github.com/EpistasisLab/tpot/tree/v0.11.1-resAdj. CONCLUSIONS: In this work, we address an important need in the context of AutoML, which is particularly crucial for applications to bioinformatics and medical informatics, namely covariate adjustments. To this end we present a substantial extension of TPOT, a genetic programming based AutoML approach. We show the utility of this extension by applications to large toxicogenomics and differential gene expression data. The method is generally applicable in many other scenarios from the biomedical field.
format Online
Article
Text
id pubmed-7528347
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-75283472020-10-02 Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses Manduchi, Elisabetta Fu, Weixuan Romano, Joseph D. Ruberto, Stefano Moore, Jason H. BMC Bioinformatics Methodology Article BACKGROUND: A typical task in bioinformatics consists of identifying which features are associated with a target outcome of interest and building a predictive model. Automated machine learning (AutoML) systems such as the Tree-based Pipeline Optimization Tool (TPOT) constitute an appealing approach to this end. However, in biomedical data, there are often baseline characteristics of the subjects in a study or batch effects that need to be adjusted for in order to better isolate the effects of the features of interest on the target. Thus, the ability to perform covariate adjustments becomes particularly important for applications of AutoML to biomedical big data analysis. RESULTS: We developed an approach to adjust for covariates affecting features and/or target in TPOT. Our approach is based on regressing out the covariates in a manner that avoids ‘leakage’ during the cross-validation training procedure. We describe applications of this approach to toxicogenomics and schizophrenia gene expression data sets. The TPOT extensions discussed in this work are available at https://github.com/EpistasisLab/tpot/tree/v0.11.1-resAdj. CONCLUSIONS: In this work, we address an important need in the context of AutoML, which is particularly crucial for applications to bioinformatics and medical informatics, namely covariate adjustments. To this end we present a substantial extension of TPOT, a genetic programming based AutoML approach. We show the utility of this extension by applications to large toxicogenomics and differential gene expression data. The method is generally applicable in many other scenarios from the biomedical field. BioMed Central 2020-10-01 /pmc/articles/PMC7528347/ /pubmed/32998684 http://dx.doi.org/10.1186/s12859-020-03755-4 Text en © The Author(s) 2020 Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Methodology Article
Manduchi, Elisabetta
Fu, Weixuan
Romano, Joseph D.
Ruberto, Stefano
Moore, Jason H.
Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses
title Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses
title_full Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses
title_fullStr Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses
title_full_unstemmed Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses
title_short Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses
title_sort embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7528347/
https://www.ncbi.nlm.nih.gov/pubmed/32998684
http://dx.doi.org/10.1186/s12859-020-03755-4
work_keys_str_mv AT manduchielisabetta embeddingcovariateadjustmentsintreebasedautomatedmachinelearningforbiomedicalbigdataanalyses
AT fuweixuan embeddingcovariateadjustmentsintreebasedautomatedmachinelearningforbiomedicalbigdataanalyses
AT romanojosephd embeddingcovariateadjustmentsintreebasedautomatedmachinelearningforbiomedicalbigdataanalyses
AT rubertostefano embeddingcovariateadjustmentsintreebasedautomatedmachinelearningforbiomedicalbigdataanalyses
AT moorejasonh embeddingcovariateadjustmentsintreebasedautomatedmachinelearningforbiomedicalbigdataanalyses