Cargando…

Explaining multivariate molecular diagnostic tests via Shapley values

BACKGROUND: Machine learning (ML) can be an effective tool to extract information from attribute-rich molecular datasets for the generation of molecular diagnostic tests. However, the way in which the resulting scores or classifications are produced from the input data may not be transparent. Algori...

Descripción completa

Detalles Bibliográficos
Autores principales: Roder, Joanna, Maguire, Laura, Georgantas, Robert, Roder, Heinrich
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8265031/
https://www.ncbi.nlm.nih.gov/pubmed/34238309
http://dx.doi.org/10.1186/s12911-021-01569-9
_version_ 1783719687038697472
author Roder, Joanna
Maguire, Laura
Georgantas, Robert
Roder, Heinrich
author_facet Roder, Joanna
Maguire, Laura
Georgantas, Robert
Roder, Heinrich
author_sort Roder, Joanna
collection PubMed
description BACKGROUND: Machine learning (ML) can be an effective tool to extract information from attribute-rich molecular datasets for the generation of molecular diagnostic tests. However, the way in which the resulting scores or classifications are produced from the input data may not be transparent. Algorithmic explainability or interpretability has become a focus of ML research. Shapley values, first introduced in game theory, can provide explanations of the result generated from a specific set of input data by a complex ML algorithm. METHODS: For a multivariate molecular diagnostic test in clinical use (the VeriStrat® test), we calculate and discuss the interpretation of exact Shapley values. We also employ some standard approximation techniques for Shapley value computation (local interpretable model-agnostic explanation (LIME) and Shapley Additive Explanations (SHAP) based methods) and compare the results with exact Shapley values. RESULTS: Exact Shapley values calculated for data collected from a cohort of 256 patients showed that the relative importance of attributes for test classification varied by sample. While all eight features used in the VeriStrat® test contributed equally to classification for some samples, other samples showed more complex patterns of attribute importance for classification generation. Exact Shapley values and Shapley-based interaction metrics were able to provide interpretable classification explanations at the sample or patient level, while patient subgroups could be defined by comparing Shapley value profiles between patients. LIME and SHAP approximation approaches, even those seeking to include correlations between attributes, produced results that were quantitatively and, in some cases qualitatively, different from the exact Shapley values. CONCLUSIONS: Shapley values can be used to determine the relative importance of input attributes to the result generated by a multivariate molecular diagnostic test for an individual sample or patient. Patient subgroups defined by Shapley value profiles may motivate translational research. However, correlations inherent in molecular data and the typically small ML training sets available for molecular diagnostic test development may cause some approximation methods to produce approximate Shapley values that differ both qualitatively and quantitatively from exact Shapley values. Hence, caution is advised when using approximate methods to evaluate Shapley explanations of the results of molecular diagnostic tests. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12911-021-01569-9.
format Online
Article
Text
id pubmed-8265031
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-82650312021-07-08 Explaining multivariate molecular diagnostic tests via Shapley values Roder, Joanna Maguire, Laura Georgantas, Robert Roder, Heinrich BMC Med Inform Decis Mak Research BACKGROUND: Machine learning (ML) can be an effective tool to extract information from attribute-rich molecular datasets for the generation of molecular diagnostic tests. However, the way in which the resulting scores or classifications are produced from the input data may not be transparent. Algorithmic explainability or interpretability has become a focus of ML research. Shapley values, first introduced in game theory, can provide explanations of the result generated from a specific set of input data by a complex ML algorithm. METHODS: For a multivariate molecular diagnostic test in clinical use (the VeriStrat® test), we calculate and discuss the interpretation of exact Shapley values. We also employ some standard approximation techniques for Shapley value computation (local interpretable model-agnostic explanation (LIME) and Shapley Additive Explanations (SHAP) based methods) and compare the results with exact Shapley values. RESULTS: Exact Shapley values calculated for data collected from a cohort of 256 patients showed that the relative importance of attributes for test classification varied by sample. While all eight features used in the VeriStrat® test contributed equally to classification for some samples, other samples showed more complex patterns of attribute importance for classification generation. Exact Shapley values and Shapley-based interaction metrics were able to provide interpretable classification explanations at the sample or patient level, while patient subgroups could be defined by comparing Shapley value profiles between patients. LIME and SHAP approximation approaches, even those seeking to include correlations between attributes, produced results that were quantitatively and, in some cases qualitatively, different from the exact Shapley values. CONCLUSIONS: Shapley values can be used to determine the relative importance of input attributes to the result generated by a multivariate molecular diagnostic test for an individual sample or patient. Patient subgroups defined by Shapley value profiles may motivate translational research. However, correlations inherent in molecular data and the typically small ML training sets available for molecular diagnostic test development may cause some approximation methods to produce approximate Shapley values that differ both qualitatively and quantitatively from exact Shapley values. Hence, caution is advised when using approximate methods to evaluate Shapley explanations of the results of molecular diagnostic tests. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12911-021-01569-9. BioMed Central 2021-07-08 /pmc/articles/PMC8265031/ /pubmed/34238309 http://dx.doi.org/10.1186/s12911-021-01569-9 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Roder, Joanna
Maguire, Laura
Georgantas, Robert
Roder, Heinrich
Explaining multivariate molecular diagnostic tests via Shapley values
title Explaining multivariate molecular diagnostic tests via Shapley values
title_full Explaining multivariate molecular diagnostic tests via Shapley values
title_fullStr Explaining multivariate molecular diagnostic tests via Shapley values
title_full_unstemmed Explaining multivariate molecular diagnostic tests via Shapley values
title_short Explaining multivariate molecular diagnostic tests via Shapley values
title_sort explaining multivariate molecular diagnostic tests via shapley values
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8265031/
https://www.ncbi.nlm.nih.gov/pubmed/34238309
http://dx.doi.org/10.1186/s12911-021-01569-9
work_keys_str_mv AT roderjoanna explainingmultivariatemoleculardiagnostictestsviashapleyvalues
AT maguirelaura explainingmultivariatemoleculardiagnostictestsviashapleyvalues
AT georgantasrobert explainingmultivariatemoleculardiagnostictestsviashapleyvalues
AT roderheinrich explainingmultivariatemoleculardiagnostictestsviashapleyvalues