Cargando…
Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses
BACKGROUND: A typical task in bioinformatics consists of identifying which features are associated with a target outcome of interest and building a predictive model. Automated machine learning (AutoML) systems such as the Tree-based Pipeline Optimization Tool (TPOT) constitute an appealing approach...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7528347/ https://www.ncbi.nlm.nih.gov/pubmed/32998684 http://dx.doi.org/10.1186/s12859-020-03755-4 |
_version_ | 1783589244854337536 |
---|---|
author | Manduchi, Elisabetta Fu, Weixuan Romano, Joseph D. Ruberto, Stefano Moore, Jason H. |
author_facet | Manduchi, Elisabetta Fu, Weixuan Romano, Joseph D. Ruberto, Stefano Moore, Jason H. |
author_sort | Manduchi, Elisabetta |
collection | PubMed |
description | BACKGROUND: A typical task in bioinformatics consists of identifying which features are associated with a target outcome of interest and building a predictive model. Automated machine learning (AutoML) systems such as the Tree-based Pipeline Optimization Tool (TPOT) constitute an appealing approach to this end. However, in biomedical data, there are often baseline characteristics of the subjects in a study or batch effects that need to be adjusted for in order to better isolate the effects of the features of interest on the target. Thus, the ability to perform covariate adjustments becomes particularly important for applications of AutoML to biomedical big data analysis. RESULTS: We developed an approach to adjust for covariates affecting features and/or target in TPOT. Our approach is based on regressing out the covariates in a manner that avoids ‘leakage’ during the cross-validation training procedure. We describe applications of this approach to toxicogenomics and schizophrenia gene expression data sets. The TPOT extensions discussed in this work are available at https://github.com/EpistasisLab/tpot/tree/v0.11.1-resAdj. CONCLUSIONS: In this work, we address an important need in the context of AutoML, which is particularly crucial for applications to bioinformatics and medical informatics, namely covariate adjustments. To this end we present a substantial extension of TPOT, a genetic programming based AutoML approach. We show the utility of this extension by applications to large toxicogenomics and differential gene expression data. The method is generally applicable in many other scenarios from the biomedical field. |
format | Online Article Text |
id | pubmed-7528347 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-75283472020-10-02 Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses Manduchi, Elisabetta Fu, Weixuan Romano, Joseph D. Ruberto, Stefano Moore, Jason H. BMC Bioinformatics Methodology Article BACKGROUND: A typical task in bioinformatics consists of identifying which features are associated with a target outcome of interest and building a predictive model. Automated machine learning (AutoML) systems such as the Tree-based Pipeline Optimization Tool (TPOT) constitute an appealing approach to this end. However, in biomedical data, there are often baseline characteristics of the subjects in a study or batch effects that need to be adjusted for in order to better isolate the effects of the features of interest on the target. Thus, the ability to perform covariate adjustments becomes particularly important for applications of AutoML to biomedical big data analysis. RESULTS: We developed an approach to adjust for covariates affecting features and/or target in TPOT. Our approach is based on regressing out the covariates in a manner that avoids ‘leakage’ during the cross-validation training procedure. We describe applications of this approach to toxicogenomics and schizophrenia gene expression data sets. The TPOT extensions discussed in this work are available at https://github.com/EpistasisLab/tpot/tree/v0.11.1-resAdj. CONCLUSIONS: In this work, we address an important need in the context of AutoML, which is particularly crucial for applications to bioinformatics and medical informatics, namely covariate adjustments. To this end we present a substantial extension of TPOT, a genetic programming based AutoML approach. We show the utility of this extension by applications to large toxicogenomics and differential gene expression data. The method is generally applicable in many other scenarios from the biomedical field. BioMed Central 2020-10-01 /pmc/articles/PMC7528347/ /pubmed/32998684 http://dx.doi.org/10.1186/s12859-020-03755-4 Text en © The Author(s) 2020 Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Methodology Article Manduchi, Elisabetta Fu, Weixuan Romano, Joseph D. Ruberto, Stefano Moore, Jason H. Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses |
title | Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses |
title_full | Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses |
title_fullStr | Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses |
title_full_unstemmed | Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses |
title_short | Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses |
title_sort | embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses |
topic | Methodology Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7528347/ https://www.ncbi.nlm.nih.gov/pubmed/32998684 http://dx.doi.org/10.1186/s12859-020-03755-4 |
work_keys_str_mv | AT manduchielisabetta embeddingcovariateadjustmentsintreebasedautomatedmachinelearningforbiomedicalbigdataanalyses AT fuweixuan embeddingcovariateadjustmentsintreebasedautomatedmachinelearningforbiomedicalbigdataanalyses AT romanojosephd embeddingcovariateadjustmentsintreebasedautomatedmachinelearningforbiomedicalbigdataanalyses AT rubertostefano embeddingcovariateadjustmentsintreebasedautomatedmachinelearningforbiomedicalbigdataanalyses AT moorejasonh embeddingcovariateadjustmentsintreebasedautomatedmachinelearningforbiomedicalbigdataanalyses |