Cargando…
Combining machine learning and conventional statistical approaches for risk factor discovery in a large cohort study
We present a simple and efficient hypothesis-free machine learning pipeline for risk factor discovery that accounts for non-linearity and interaction in large biomedical databases with minimal variable pre-processing. In this study, mortality models were built using gradient boosting decision trees...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Nature Publishing Group UK
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8626442/ https://www.ncbi.nlm.nih.gov/pubmed/34837000 http://dx.doi.org/10.1038/s41598-021-02476-9 |
_version_ | 1784606657612873728 |
---|---|
author | Madakkatel, Iqbal Zhou, Ang McDonnell, Mark D. Hyppönen, Elina |
author_facet | Madakkatel, Iqbal Zhou, Ang McDonnell, Mark D. Hyppönen, Elina |
author_sort | Madakkatel, Iqbal |
collection | PubMed |
description | We present a simple and efficient hypothesis-free machine learning pipeline for risk factor discovery that accounts for non-linearity and interaction in large biomedical databases with minimal variable pre-processing. In this study, mortality models were built using gradient boosting decision trees (GBDT) and important predictors were identified using a Shapley values-based feature attribution method, SHAP values. Cox models controlled for false discovery rate were used for confounder adjustment, interpretability, and further validation. The pipeline was tested using information from 502,506 UK Biobank participants, aged 37–73 years at recruitment and followed over seven years for mortality registrations. From the 11,639 predictors included in GBDT, 193 potential risk factors had SHAP values ≥ 0.05, passed the correlation test, and were selected for further modelling. Of the total variable importance summed up, 60% was directly health related, and baseline characteristics, sociodemographics, and lifestyle factors each contributed about 10%. Cox models adjusted for baseline characteristics, showed evidence for an association with mortality for 166 out of the 193 predictors. These included mostly well-known risk factors (e.g., age, sex, ethnicity, education, material deprivation, smoking, physical activity, self-rated health, BMI, and many disease outcomes). For 19 predictors we saw evidence for an association in the unadjusted but not adjusted analyses, suggesting bias by confounding. Our GBDT-SHAP pipeline was able to identify relevant predictors ‘hidden’ within thousands of variables, providing an efficient and pragmatic solution for the first stage of hypothesis free risk factor identification. |
format | Online Article Text |
id | pubmed-8626442 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | Nature Publishing Group UK |
record_format | MEDLINE/PubMed |
spelling | pubmed-86264422021-11-29 Combining machine learning and conventional statistical approaches for risk factor discovery in a large cohort study Madakkatel, Iqbal Zhou, Ang McDonnell, Mark D. Hyppönen, Elina Sci Rep Article We present a simple and efficient hypothesis-free machine learning pipeline for risk factor discovery that accounts for non-linearity and interaction in large biomedical databases with minimal variable pre-processing. In this study, mortality models were built using gradient boosting decision trees (GBDT) and important predictors were identified using a Shapley values-based feature attribution method, SHAP values. Cox models controlled for false discovery rate were used for confounder adjustment, interpretability, and further validation. The pipeline was tested using information from 502,506 UK Biobank participants, aged 37–73 years at recruitment and followed over seven years for mortality registrations. From the 11,639 predictors included in GBDT, 193 potential risk factors had SHAP values ≥ 0.05, passed the correlation test, and were selected for further modelling. Of the total variable importance summed up, 60% was directly health related, and baseline characteristics, sociodemographics, and lifestyle factors each contributed about 10%. Cox models adjusted for baseline characteristics, showed evidence for an association with mortality for 166 out of the 193 predictors. These included mostly well-known risk factors (e.g., age, sex, ethnicity, education, material deprivation, smoking, physical activity, self-rated health, BMI, and many disease outcomes). For 19 predictors we saw evidence for an association in the unadjusted but not adjusted analyses, suggesting bias by confounding. Our GBDT-SHAP pipeline was able to identify relevant predictors ‘hidden’ within thousands of variables, providing an efficient and pragmatic solution for the first stage of hypothesis free risk factor identification. Nature Publishing Group UK 2021-11-26 /pmc/articles/PMC8626442/ /pubmed/34837000 http://dx.doi.org/10.1038/s41598-021-02476-9 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . |
spellingShingle | Article Madakkatel, Iqbal Zhou, Ang McDonnell, Mark D. Hyppönen, Elina Combining machine learning and conventional statistical approaches for risk factor discovery in a large cohort study |
title | Combining machine learning and conventional statistical approaches for risk factor discovery in a large cohort study |
title_full | Combining machine learning and conventional statistical approaches for risk factor discovery in a large cohort study |
title_fullStr | Combining machine learning and conventional statistical approaches for risk factor discovery in a large cohort study |
title_full_unstemmed | Combining machine learning and conventional statistical approaches for risk factor discovery in a large cohort study |
title_short | Combining machine learning and conventional statistical approaches for risk factor discovery in a large cohort study |
title_sort | combining machine learning and conventional statistical approaches for risk factor discovery in a large cohort study |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8626442/ https://www.ncbi.nlm.nih.gov/pubmed/34837000 http://dx.doi.org/10.1038/s41598-021-02476-9 |
work_keys_str_mv | AT madakkateliqbal combiningmachinelearningandconventionalstatisticalapproachesforriskfactordiscoveryinalargecohortstudy AT zhouang combiningmachinelearningandconventionalstatisticalapproachesforriskfactordiscoveryinalargecohortstudy AT mcdonnellmarkd combiningmachinelearningandconventionalstatisticalapproachesforriskfactordiscoveryinalargecohortstudy AT hypponenelina combiningmachinelearningandconventionalstatisticalapproachesforriskfactordiscoveryinalargecohortstudy |