Cargando…

Data analysis with Shapley values for automatic subject selection in Alzheimer’s disease data sets using interpretable machine learning

BACKGROUND: For the recruitment and monitoring of subjects for therapy studies, it is important to predict whether mild cognitive impaired (MCI) subjects will prospectively develop Alzheimer’s disease (AD). Machine learning (ML) is suitable to improve early AD prediction. The etiology of AD is heter...

Descripción completa

Detalles Bibliográficos
Autores principales: Bloch, Louise, Friedrich, Christoph M.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8444618/
https://www.ncbi.nlm.nih.gov/pubmed/34526114
http://dx.doi.org/10.1186/s13195-021-00879-4
_version_ 1784568534950477824
author Bloch, Louise
Friedrich, Christoph M.
author_facet Bloch, Louise
Friedrich, Christoph M.
author_sort Bloch, Louise
collection PubMed
description BACKGROUND: For the recruitment and monitoring of subjects for therapy studies, it is important to predict whether mild cognitive impaired (MCI) subjects will prospectively develop Alzheimer’s disease (AD). Machine learning (ML) is suitable to improve early AD prediction. The etiology of AD is heterogeneous, which leads to high variability in disease patterns. Further variability originates from multicentric study designs, varying acquisition protocols, and errors in the preprocessing of magnetic resonance imaging (MRI) scans. The high variability makes the differentiation between signal and noise difficult and may lead to overfitting. This article examines whether an automatic and fair data valuation method based on Shapley values can identify the most informative subjects to improve ML classification. METHODS: An ML workflow was developed and trained for a subset of the Alzheimer’s Disease Neuroimaging Initiative (ADNI) cohort. The validation was executed for an independent ADNI test set and for the Australian Imaging, Biomarker and Lifestyle Flagship Study of Ageing (AIBL) cohort. The workflow included volumetric MRI feature extraction, feature selection, sample selection using Data Shapley, random forest (RF), and eXtreme Gradient Boosting (XGBoost) for model training as well as Kernel SHapley Additive exPlanations (SHAP) values for model interpretation. RESULTS: The RF models, which excluded 134 of the 467 training subjects based on their RF Data Shapley values, outperformed the base models that reached a mean accuracy of 62.64% by 5.76% (3.61 percentage points) for the independent ADNI test set. The XGBoost base models reached a mean accuracy of 60.00% for the AIBL data set. The exclusion of those 133 subjects with the smallest RF Data Shapley values could improve the classification accuracy by 2.98% (1.79 percentage points). The cutoff values were calculated using an independent validation set. CONCLUSION: The Data Shapley method was able to improve the mean accuracies for the test sets. The most informative subjects were associated with the number of ApolipoproteinE ε4 (ApoE ε4) alleles, cognitive test results, and volumetric MRI measurements.
format Online
Article
Text
id pubmed-8444618
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-84446182021-09-17 Data analysis with Shapley values for automatic subject selection in Alzheimer’s disease data sets using interpretable machine learning Bloch, Louise Friedrich, Christoph M. Alzheimers Res Ther Research BACKGROUND: For the recruitment and monitoring of subjects for therapy studies, it is important to predict whether mild cognitive impaired (MCI) subjects will prospectively develop Alzheimer’s disease (AD). Machine learning (ML) is suitable to improve early AD prediction. The etiology of AD is heterogeneous, which leads to high variability in disease patterns. Further variability originates from multicentric study designs, varying acquisition protocols, and errors in the preprocessing of magnetic resonance imaging (MRI) scans. The high variability makes the differentiation between signal and noise difficult and may lead to overfitting. This article examines whether an automatic and fair data valuation method based on Shapley values can identify the most informative subjects to improve ML classification. METHODS: An ML workflow was developed and trained for a subset of the Alzheimer’s Disease Neuroimaging Initiative (ADNI) cohort. The validation was executed for an independent ADNI test set and for the Australian Imaging, Biomarker and Lifestyle Flagship Study of Ageing (AIBL) cohort. The workflow included volumetric MRI feature extraction, feature selection, sample selection using Data Shapley, random forest (RF), and eXtreme Gradient Boosting (XGBoost) for model training as well as Kernel SHapley Additive exPlanations (SHAP) values for model interpretation. RESULTS: The RF models, which excluded 134 of the 467 training subjects based on their RF Data Shapley values, outperformed the base models that reached a mean accuracy of 62.64% by 5.76% (3.61 percentage points) for the independent ADNI test set. The XGBoost base models reached a mean accuracy of 60.00% for the AIBL data set. The exclusion of those 133 subjects with the smallest RF Data Shapley values could improve the classification accuracy by 2.98% (1.79 percentage points). The cutoff values were calculated using an independent validation set. CONCLUSION: The Data Shapley method was able to improve the mean accuracies for the test sets. The most informative subjects were associated with the number of ApolipoproteinE ε4 (ApoE ε4) alleles, cognitive test results, and volumetric MRI measurements. BioMed Central 2021-09-15 /pmc/articles/PMC8444618/ /pubmed/34526114 http://dx.doi.org/10.1186/s13195-021-00879-4 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Bloch, Louise
Friedrich, Christoph M.
Data analysis with Shapley values for automatic subject selection in Alzheimer’s disease data sets using interpretable machine learning
title Data analysis with Shapley values for automatic subject selection in Alzheimer’s disease data sets using interpretable machine learning
title_full Data analysis with Shapley values for automatic subject selection in Alzheimer’s disease data sets using interpretable machine learning
title_fullStr Data analysis with Shapley values for automatic subject selection in Alzheimer’s disease data sets using interpretable machine learning
title_full_unstemmed Data analysis with Shapley values for automatic subject selection in Alzheimer’s disease data sets using interpretable machine learning
title_short Data analysis with Shapley values for automatic subject selection in Alzheimer’s disease data sets using interpretable machine learning
title_sort data analysis with shapley values for automatic subject selection in alzheimer’s disease data sets using interpretable machine learning
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8444618/
https://www.ncbi.nlm.nih.gov/pubmed/34526114
http://dx.doi.org/10.1186/s13195-021-00879-4
work_keys_str_mv AT blochlouise dataanalysiswithshapleyvaluesforautomaticsubjectselectioninalzheimersdiseasedatasetsusinginterpretablemachinelearning
AT friedrichchristophm dataanalysiswithshapleyvaluesforautomaticsubjectselectioninalzheimersdiseasedatasetsusinginterpretablemachinelearning
AT dataanalysiswithshapleyvaluesforautomaticsubjectselectioninalzheimersdiseasedatasetsusinginterpretablemachinelearning