Cargando…

Data analysis with Shapley values for automatic subject selection in Alzheimer’s disease data sets using interpretable machine learning

BACKGROUND: For the recruitment and monitoring of subjects for therapy studies, it is important to predict whether mild cognitive impaired (MCI) subjects will prospectively develop Alzheimer’s disease (AD). Machine learning (ML) is suitable to improve early AD prediction. The etiology of AD is heter...

Descripción completa

Detalles Bibliográficos
Autores principales:	Bloch, Louise, Friedrich, Christoph M.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2021
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8444618/ https://www.ncbi.nlm.nih.gov/pubmed/34526114 http://dx.doi.org/10.1186/s13195-021-00879-4

_version_	1784568534950477824
author	Bloch, Louise Friedrich, Christoph M.
author_facet	Bloch, Louise Friedrich, Christoph M.
author_sort	Bloch, Louise
collection	PubMed
description	BACKGROUND: For the recruitment and monitoring of subjects for therapy studies, it is important to predict whether mild cognitive impaired (MCI) subjects will prospectively develop Alzheimer’s disease (AD). Machine learning (ML) is suitable to improve early AD prediction. The etiology of AD is heterogeneous, which leads to high variability in disease patterns. Further variability originates from multicentric study designs, varying acquisition protocols, and errors in the preprocessing of magnetic resonance imaging (MRI) scans. The high variability makes the differentiation between signal and noise difficult and may lead to overfitting. This article examines whether an automatic and fair data valuation method based on Shapley values can identify the most informative subjects to improve ML classification. METHODS: An ML workflow was developed and trained for a subset of the Alzheimer’s Disease Neuroimaging Initiative (ADNI) cohort. The validation was executed for an independent ADNI test set and for the Australian Imaging, Biomarker and Lifestyle Flagship Study of Ageing (AIBL) cohort. The workflow included volumetric MRI feature extraction, feature selection, sample selection using Data Shapley, random forest (RF), and eXtreme Gradient Boosting (XGBoost) for model training as well as Kernel SHapley Additive exPlanations (SHAP) values for model interpretation. RESULTS: The RF models, which excluded 134 of the 467 training subjects based on their RF Data Shapley values, outperformed the base models that reached a mean accuracy of 62.64% by 5.76% (3.61 percentage points) for the independent ADNI test set. The XGBoost base models reached a mean accuracy of 60.00% for the AIBL data set. The exclusion of those 133 subjects with the smallest RF Data Shapley values could improve the classification accuracy by 2.98% (1.79 percentage points). The cutoff values were calculated using an independent validation set. CONCLUSION: The Data Shapley method was able to improve the mean accuracies for the test sets. The most informative subjects were associated with the number of ApolipoproteinE ε4 (ApoE ε4) alleles, cognitive test results, and volumetric MRI measurements.
format	Online Article Text
id	pubmed-8444618
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-84446182021-09-17 Data analysis with Shapley values for automatic subject selection in Alzheimer’s disease data sets using interpretable machine learning Bloch, Louise Friedrich, Christoph M. Alzheimers Res Ther Research BACKGROUND: For the recruitment and monitoring of subjects for therapy studies, it is important to predict whether mild cognitive impaired (MCI) subjects will prospectively develop Alzheimer’s disease (AD). Machine learning (ML) is suitable to improve early AD prediction. The etiology of AD is heterogeneous, which leads to high variability in disease patterns. Further variability originates from multicentric study designs, varying acquisition protocols, and errors in the preprocessing of magnetic resonance imaging (MRI) scans. The high variability makes the differentiation between signal and noise difficult and may lead to overfitting. This article examines whether an automatic and fair data valuation method based on Shapley values can identify the most informative subjects to improve ML classification. METHODS: An ML workflow was developed and trained for a subset of the Alzheimer’s Disease Neuroimaging Initiative (ADNI) cohort. The validation was executed for an independent ADNI test set and for the Australian Imaging, Biomarker and Lifestyle Flagship Study of Ageing (AIBL) cohort. The workflow included volumetric MRI feature extraction, feature selection, sample selection using Data Shapley, random forest (RF), and eXtreme Gradient Boosting (XGBoost) for model training as well as Kernel SHapley Additive exPlanations (SHAP) values for model interpretation. RESULTS: The RF models, which excluded 134 of the 467 training subjects based on their RF Data Shapley values, outperformed the base models that reached a mean accuracy of 62.64% by 5.76% (3.61 percentage points) for the independent ADNI test set. The XGBoost base models reached a mean accuracy of 60.00% for the AIBL data set. The exclusion of those 133 subjects with the smallest RF Data Shapley values could improve the classification accuracy by 2.98% (1.79 percentage points). The cutoff values were calculated using an independent validation set. CONCLUSION: The Data Shapley method was able to improve the mean accuracies for the test sets. The most informative subjects were associated with the number of ApolipoproteinE ε4 (ApoE ε4) alleles, cognitive test results, and volumetric MRI measurements. BioMed Central 2021-09-15 /pmc/articles/PMC8444618/ /pubmed/34526114 http://dx.doi.org/10.1186/s13195-021-00879-4 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Research Bloch, Louise Friedrich, Christoph M. Data analysis with Shapley values for automatic subject selection in Alzheimer’s disease data sets using interpretable machine learning
title	Data analysis with Shapley values for automatic subject selection in Alzheimer’s disease data sets using interpretable machine learning
title_full	Data analysis with Shapley values for automatic subject selection in Alzheimer’s disease data sets using interpretable machine learning
title_fullStr	Data analysis with Shapley values for automatic subject selection in Alzheimer’s disease data sets using interpretable machine learning
title_full_unstemmed	Data analysis with Shapley values for automatic subject selection in Alzheimer’s disease data sets using interpretable machine learning
title_short	Data analysis with Shapley values for automatic subject selection in Alzheimer’s disease data sets using interpretable machine learning
title_sort	data analysis with shapley values for automatic subject selection in alzheimer’s disease data sets using interpretable machine learning
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8444618/ https://www.ncbi.nlm.nih.gov/pubmed/34526114 http://dx.doi.org/10.1186/s13195-021-00879-4
work_keys_str_mv	AT blochlouise dataanalysiswithshapleyvaluesforautomaticsubjectselectioninalzheimersdiseasedatasetsusinginterpretablemachinelearning AT friedrichchristophm dataanalysiswithshapleyvaluesforautomaticsubjectselectioninalzheimersdiseasedatasetsusinginterpretablemachinelearning AT dataanalysiswithshapleyvaluesforautomaticsubjectselectioninalzheimersdiseasedatasetsusinginterpretablemachinelearning

Data analysis with Shapley values for automatic subject selection in Alzheimer’s disease data sets using interpretable machine learning

Ejemplares similares