Cargando…
Data analysis with Shapley values for automatic subject selection in Alzheimer’s disease data sets using interpretable machine learning
BACKGROUND: For the recruitment and monitoring of subjects for therapy studies, it is important to predict whether mild cognitive impaired (MCI) subjects will prospectively develop Alzheimer’s disease (AD). Machine learning (ML) is suitable to improve early AD prediction. The etiology of AD is heter...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8444618/ https://www.ncbi.nlm.nih.gov/pubmed/34526114 http://dx.doi.org/10.1186/s13195-021-00879-4 |
_version_ | 1784568534950477824 |
---|---|
author | Bloch, Louise Friedrich, Christoph M. |
author_facet | Bloch, Louise Friedrich, Christoph M. |
author_sort | Bloch, Louise |
collection | PubMed |
description | BACKGROUND: For the recruitment and monitoring of subjects for therapy studies, it is important to predict whether mild cognitive impaired (MCI) subjects will prospectively develop Alzheimer’s disease (AD). Machine learning (ML) is suitable to improve early AD prediction. The etiology of AD is heterogeneous, which leads to high variability in disease patterns. Further variability originates from multicentric study designs, varying acquisition protocols, and errors in the preprocessing of magnetic resonance imaging (MRI) scans. The high variability makes the differentiation between signal and noise difficult and may lead to overfitting. This article examines whether an automatic and fair data valuation method based on Shapley values can identify the most informative subjects to improve ML classification. METHODS: An ML workflow was developed and trained for a subset of the Alzheimer’s Disease Neuroimaging Initiative (ADNI) cohort. The validation was executed for an independent ADNI test set and for the Australian Imaging, Biomarker and Lifestyle Flagship Study of Ageing (AIBL) cohort. The workflow included volumetric MRI feature extraction, feature selection, sample selection using Data Shapley, random forest (RF), and eXtreme Gradient Boosting (XGBoost) for model training as well as Kernel SHapley Additive exPlanations (SHAP) values for model interpretation. RESULTS: The RF models, which excluded 134 of the 467 training subjects based on their RF Data Shapley values, outperformed the base models that reached a mean accuracy of 62.64% by 5.76% (3.61 percentage points) for the independent ADNI test set. The XGBoost base models reached a mean accuracy of 60.00% for the AIBL data set. The exclusion of those 133 subjects with the smallest RF Data Shapley values could improve the classification accuracy by 2.98% (1.79 percentage points). The cutoff values were calculated using an independent validation set. CONCLUSION: The Data Shapley method was able to improve the mean accuracies for the test sets. The most informative subjects were associated with the number of ApolipoproteinE ε4 (ApoE ε4) alleles, cognitive test results, and volumetric MRI measurements. |
format | Online Article Text |
id | pubmed-8444618 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-84446182021-09-17 Data analysis with Shapley values for automatic subject selection in Alzheimer’s disease data sets using interpretable machine learning Bloch, Louise Friedrich, Christoph M. Alzheimers Res Ther Research BACKGROUND: For the recruitment and monitoring of subjects for therapy studies, it is important to predict whether mild cognitive impaired (MCI) subjects will prospectively develop Alzheimer’s disease (AD). Machine learning (ML) is suitable to improve early AD prediction. The etiology of AD is heterogeneous, which leads to high variability in disease patterns. Further variability originates from multicentric study designs, varying acquisition protocols, and errors in the preprocessing of magnetic resonance imaging (MRI) scans. The high variability makes the differentiation between signal and noise difficult and may lead to overfitting. This article examines whether an automatic and fair data valuation method based on Shapley values can identify the most informative subjects to improve ML classification. METHODS: An ML workflow was developed and trained for a subset of the Alzheimer’s Disease Neuroimaging Initiative (ADNI) cohort. The validation was executed for an independent ADNI test set and for the Australian Imaging, Biomarker and Lifestyle Flagship Study of Ageing (AIBL) cohort. The workflow included volumetric MRI feature extraction, feature selection, sample selection using Data Shapley, random forest (RF), and eXtreme Gradient Boosting (XGBoost) for model training as well as Kernel SHapley Additive exPlanations (SHAP) values for model interpretation. RESULTS: The RF models, which excluded 134 of the 467 training subjects based on their RF Data Shapley values, outperformed the base models that reached a mean accuracy of 62.64% by 5.76% (3.61 percentage points) for the independent ADNI test set. The XGBoost base models reached a mean accuracy of 60.00% for the AIBL data set. The exclusion of those 133 subjects with the smallest RF Data Shapley values could improve the classification accuracy by 2.98% (1.79 percentage points). The cutoff values were calculated using an independent validation set. CONCLUSION: The Data Shapley method was able to improve the mean accuracies for the test sets. The most informative subjects were associated with the number of ApolipoproteinE ε4 (ApoE ε4) alleles, cognitive test results, and volumetric MRI measurements. BioMed Central 2021-09-15 /pmc/articles/PMC8444618/ /pubmed/34526114 http://dx.doi.org/10.1186/s13195-021-00879-4 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Research Bloch, Louise Friedrich, Christoph M. Data analysis with Shapley values for automatic subject selection in Alzheimer’s disease data sets using interpretable machine learning |
title | Data analysis with Shapley values for automatic subject selection in Alzheimer’s disease data sets using interpretable machine learning |
title_full | Data analysis with Shapley values for automatic subject selection in Alzheimer’s disease data sets using interpretable machine learning |
title_fullStr | Data analysis with Shapley values for automatic subject selection in Alzheimer’s disease data sets using interpretable machine learning |
title_full_unstemmed | Data analysis with Shapley values for automatic subject selection in Alzheimer’s disease data sets using interpretable machine learning |
title_short | Data analysis with Shapley values for automatic subject selection in Alzheimer’s disease data sets using interpretable machine learning |
title_sort | data analysis with shapley values for automatic subject selection in alzheimer’s disease data sets using interpretable machine learning |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8444618/ https://www.ncbi.nlm.nih.gov/pubmed/34526114 http://dx.doi.org/10.1186/s13195-021-00879-4 |
work_keys_str_mv | AT blochlouise dataanalysiswithshapleyvaluesforautomaticsubjectselectioninalzheimersdiseasedatasetsusinginterpretablemachinelearning AT friedrichchristophm dataanalysiswithshapleyvaluesforautomaticsubjectselectioninalzheimersdiseasedatasetsusinginterpretablemachinelearning AT dataanalysiswithshapleyvaluesforautomaticsubjectselectioninalzheimersdiseasedatasetsusinginterpretablemachinelearning |