Cargando…

Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses

BACKGROUND: A typical task in bioinformatics consists of identifying which features are associated with a target outcome of interest and building a predictive model. Automated machine learning (AutoML) systems such as the Tree-based Pipeline Optimization Tool (TPOT) constitute an appealing approach...

Descripción completa

Detalles Bibliográficos
Autores principales:	Manduchi, Elisabetta, Fu, Weixuan, Romano, Joseph D., Ruberto, Stefano, Moore, Jason H.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2020
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7528347/ https://www.ncbi.nlm.nih.gov/pubmed/32998684 http://dx.doi.org/10.1186/s12859-020-03755-4

_version_	1783589244854337536
author	Manduchi, Elisabetta Fu, Weixuan Romano, Joseph D. Ruberto, Stefano Moore, Jason H.
author_facet	Manduchi, Elisabetta Fu, Weixuan Romano, Joseph D. Ruberto, Stefano Moore, Jason H.
author_sort	Manduchi, Elisabetta
collection	PubMed
description	BACKGROUND: A typical task in bioinformatics consists of identifying which features are associated with a target outcome of interest and building a predictive model. Automated machine learning (AutoML) systems such as the Tree-based Pipeline Optimization Tool (TPOT) constitute an appealing approach to this end. However, in biomedical data, there are often baseline characteristics of the subjects in a study or batch effects that need to be adjusted for in order to better isolate the effects of the features of interest on the target. Thus, the ability to perform covariate adjustments becomes particularly important for applications of AutoML to biomedical big data analysis. RESULTS: We developed an approach to adjust for covariates affecting features and/or target in TPOT. Our approach is based on regressing out the covariates in a manner that avoids ‘leakage’ during the cross-validation training procedure. We describe applications of this approach to toxicogenomics and schizophrenia gene expression data sets. The TPOT extensions discussed in this work are available at https://github.com/EpistasisLab/tpot/tree/v0.11.1-resAdj. CONCLUSIONS: In this work, we address an important need in the context of AutoML, which is particularly crucial for applications to bioinformatics and medical informatics, namely covariate adjustments. To this end we present a substantial extension of TPOT, a genetic programming based AutoML approach. We show the utility of this extension by applications to large toxicogenomics and differential gene expression data. The method is generally applicable in many other scenarios from the biomedical field.
format	Online Article Text
id	pubmed-7528347
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-75283472020-10-02 Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses Manduchi, Elisabetta Fu, Weixuan Romano, Joseph D. Ruberto, Stefano Moore, Jason H. BMC Bioinformatics Methodology Article BACKGROUND: A typical task in bioinformatics consists of identifying which features are associated with a target outcome of interest and building a predictive model. Automated machine learning (AutoML) systems such as the Tree-based Pipeline Optimization Tool (TPOT) constitute an appealing approach to this end. However, in biomedical data, there are often baseline characteristics of the subjects in a study or batch effects that need to be adjusted for in order to better isolate the effects of the features of interest on the target. Thus, the ability to perform covariate adjustments becomes particularly important for applications of AutoML to biomedical big data analysis. RESULTS: We developed an approach to adjust for covariates affecting features and/or target in TPOT. Our approach is based on regressing out the covariates in a manner that avoids ‘leakage’ during the cross-validation training procedure. We describe applications of this approach to toxicogenomics and schizophrenia gene expression data sets. The TPOT extensions discussed in this work are available at https://github.com/EpistasisLab/tpot/tree/v0.11.1-resAdj. CONCLUSIONS: In this work, we address an important need in the context of AutoML, which is particularly crucial for applications to bioinformatics and medical informatics, namely covariate adjustments. To this end we present a substantial extension of TPOT, a genetic programming based AutoML approach. We show the utility of this extension by applications to large toxicogenomics and differential gene expression data. The method is generally applicable in many other scenarios from the biomedical field. BioMed Central 2020-10-01 /pmc/articles/PMC7528347/ /pubmed/32998684 http://dx.doi.org/10.1186/s12859-020-03755-4 Text en © The Author(s) 2020 Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Methodology Article Manduchi, Elisabetta Fu, Weixuan Romano, Joseph D. Ruberto, Stefano Moore, Jason H. Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses
title	Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses
title_full	Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses
title_fullStr	Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses
title_full_unstemmed	Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses
title_short	Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses
title_sort	embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7528347/ https://www.ncbi.nlm.nih.gov/pubmed/32998684 http://dx.doi.org/10.1186/s12859-020-03755-4
work_keys_str_mv	AT manduchielisabetta embeddingcovariateadjustmentsintreebasedautomatedmachinelearningforbiomedicalbigdataanalyses AT fuweixuan embeddingcovariateadjustmentsintreebasedautomatedmachinelearningforbiomedicalbigdataanalyses AT romanojosephd embeddingcovariateadjustmentsintreebasedautomatedmachinelearningforbiomedicalbigdataanalyses AT rubertostefano embeddingcovariateadjustmentsintreebasedautomatedmachinelearningforbiomedicalbigdataanalyses AT moorejasonh embeddingcovariateadjustmentsintreebasedautomatedmachinelearningforbiomedicalbigdataanalyses

Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses

Ejemplares similares