Cargando…

Scaling tree-based automated machine learning to biomedical big data with a feature set selector

MOTIVATION: Automated machine learning (AutoML) systems are helpful data science assistants designed to scan data for novel features, select appropriate supervised learning models and optimize their parameters. For this purpose, Tree-based Pipeline Optimization Tool (TPOT) was developed using strong...

Descripción completa

Detalles Bibliográficos
Autores principales: Le, Trang T, Fu, Weixuan, Moore, Jason H
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6956793/
https://www.ncbi.nlm.nih.gov/pubmed/31165141
http://dx.doi.org/10.1093/bioinformatics/btz470
_version_ 1783487205544558592
author Le, Trang T
Fu, Weixuan
Moore, Jason H
author_facet Le, Trang T
Fu, Weixuan
Moore, Jason H
author_sort Le, Trang T
collection PubMed
description MOTIVATION: Automated machine learning (AutoML) systems are helpful data science assistants designed to scan data for novel features, select appropriate supervised learning models and optimize their parameters. For this purpose, Tree-based Pipeline Optimization Tool (TPOT) was developed using strongly typed genetic programing (GP) to recommend an optimized analysis pipeline for the data scientist’s prediction problem. However, like other AutoML systems, TPOT may reach computational resource limits when working on big data such as whole-genome expression data. RESULTS: We introduce two new features implemented in TPOT that helps increase the system’s scalability: Feature Set Selector (FSS) and Template. FSS provides the option to specify subsets of the features as separate datasets, assuming the signals come from one or more of these specific data subsets. FSS increases TPOT’s efficiency in application on big data by slicing the entire dataset into smaller sets of features and allowing GP to select the best subset in the final pipeline. Template enforces type constraints with strongly typed GP and enables the incorporation of FSS at the beginning of each pipeline. Consequently, FSS and Template help reduce TPOT computation time and may provide more interpretable results. Our simulations show TPOT-FSS significantly outperforms a tuned XGBoost model and standard TPOT implementation. We apply TPOT-FSS to real RNA-Seq data from a study of major depressive disorder. Independent of the previous study that identified significant association with depression severity of two modules, TPOT-FSS corroborates that one of the modules is largely predictive of the clinical diagnosis of each individual. AVAILABILITY AND IMPLEMENTATION: Detailed simulation and analysis code needed to reproduce the results in this study is available at https://github.com/lelaboratoire/tpot-fss. Implementation of the new TPOT operators is available at https://github.com/EpistasisLab/tpot. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-6956793
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-69567932020-01-16 Scaling tree-based automated machine learning to biomedical big data with a feature set selector Le, Trang T Fu, Weixuan Moore, Jason H Bioinformatics Original Papers MOTIVATION: Automated machine learning (AutoML) systems are helpful data science assistants designed to scan data for novel features, select appropriate supervised learning models and optimize their parameters. For this purpose, Tree-based Pipeline Optimization Tool (TPOT) was developed using strongly typed genetic programing (GP) to recommend an optimized analysis pipeline for the data scientist’s prediction problem. However, like other AutoML systems, TPOT may reach computational resource limits when working on big data such as whole-genome expression data. RESULTS: We introduce two new features implemented in TPOT that helps increase the system’s scalability: Feature Set Selector (FSS) and Template. FSS provides the option to specify subsets of the features as separate datasets, assuming the signals come from one or more of these specific data subsets. FSS increases TPOT’s efficiency in application on big data by slicing the entire dataset into smaller sets of features and allowing GP to select the best subset in the final pipeline. Template enforces type constraints with strongly typed GP and enables the incorporation of FSS at the beginning of each pipeline. Consequently, FSS and Template help reduce TPOT computation time and may provide more interpretable results. Our simulations show TPOT-FSS significantly outperforms a tuned XGBoost model and standard TPOT implementation. We apply TPOT-FSS to real RNA-Seq data from a study of major depressive disorder. Independent of the previous study that identified significant association with depression severity of two modules, TPOT-FSS corroborates that one of the modules is largely predictive of the clinical diagnosis of each individual. AVAILABILITY AND IMPLEMENTATION: Detailed simulation and analysis code needed to reproduce the results in this study is available at https://github.com/lelaboratoire/tpot-fss. Implementation of the new TPOT operators is available at https://github.com/EpistasisLab/tpot. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2020-01-01 2019-06-04 /pmc/articles/PMC6956793/ /pubmed/31165141 http://dx.doi.org/10.1093/bioinformatics/btz470 Text en © The Author(s) 2019. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Papers
Le, Trang T
Fu, Weixuan
Moore, Jason H
Scaling tree-based automated machine learning to biomedical big data with a feature set selector
title Scaling tree-based automated machine learning to biomedical big data with a feature set selector
title_full Scaling tree-based automated machine learning to biomedical big data with a feature set selector
title_fullStr Scaling tree-based automated machine learning to biomedical big data with a feature set selector
title_full_unstemmed Scaling tree-based automated machine learning to biomedical big data with a feature set selector
title_short Scaling tree-based automated machine learning to biomedical big data with a feature set selector
title_sort scaling tree-based automated machine learning to biomedical big data with a feature set selector
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6956793/
https://www.ncbi.nlm.nih.gov/pubmed/31165141
http://dx.doi.org/10.1093/bioinformatics/btz470
work_keys_str_mv AT letrangt scalingtreebasedautomatedmachinelearningtobiomedicalbigdatawithafeaturesetselector
AT fuweixuan scalingtreebasedautomatedmachinelearningtobiomedicalbigdatawithafeaturesetselector
AT moorejasonh scalingtreebasedautomatedmachinelearningtobiomedicalbigdatawithafeaturesetselector