Cargando…
Inferring feature importance with uncertainties with application to large genotype data
Estimating feature importance, which is the contribution of a prediction or several predictions due to a feature, is an essential aspect of explaining data-based models. Besides explaining the model itself, an equally relevant question is which features are important in the underlying data generatin...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10038287/ https://www.ncbi.nlm.nih.gov/pubmed/36917581 http://dx.doi.org/10.1371/journal.pcbi.1010963 |
_version_ | 1784912047218098176 |
---|---|
author | Johnsen, Pål Vegard Strümke, Inga Langaas, Mette DeWan, Andrew Thomas Riemer-Sørensen, Signe |
author_facet | Johnsen, Pål Vegard Strümke, Inga Langaas, Mette DeWan, Andrew Thomas Riemer-Sørensen, Signe |
author_sort | Johnsen, Pål Vegard |
collection | PubMed |
description | Estimating feature importance, which is the contribution of a prediction or several predictions due to a feature, is an essential aspect of explaining data-based models. Besides explaining the model itself, an equally relevant question is which features are important in the underlying data generating process. We present a Shapley-value-based framework for inferring the importance of individual features, including uncertainty in the estimator. We build upon the recently published model-agnostic feature importance score of SAGE (Shapley additive global importance) and introduce Sub-SAGE. For tree-based models, it has the advantage that it can be estimated without computationally expensive resampling. We argue that for all model types the uncertainties in our Sub-SAGE estimator can be estimated using bootstrapping and demonstrate the approach for tree ensemble methods. The framework is exemplified on synthetic data as well as large genotype data for predicting feature importance with respect to obesity. |
format | Online Article Text |
id | pubmed-10038287 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-100382872023-03-25 Inferring feature importance with uncertainties with application to large genotype data Johnsen, Pål Vegard Strümke, Inga Langaas, Mette DeWan, Andrew Thomas Riemer-Sørensen, Signe PLoS Comput Biol Research Article Estimating feature importance, which is the contribution of a prediction or several predictions due to a feature, is an essential aspect of explaining data-based models. Besides explaining the model itself, an equally relevant question is which features are important in the underlying data generating process. We present a Shapley-value-based framework for inferring the importance of individual features, including uncertainty in the estimator. We build upon the recently published model-agnostic feature importance score of SAGE (Shapley additive global importance) and introduce Sub-SAGE. For tree-based models, it has the advantage that it can be estimated without computationally expensive resampling. We argue that for all model types the uncertainties in our Sub-SAGE estimator can be estimated using bootstrapping and demonstrate the approach for tree ensemble methods. The framework is exemplified on synthetic data as well as large genotype data for predicting feature importance with respect to obesity. Public Library of Science 2023-03-14 /pmc/articles/PMC10038287/ /pubmed/36917581 http://dx.doi.org/10.1371/journal.pcbi.1010963 Text en © 2023 Johnsen et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. |
spellingShingle | Research Article Johnsen, Pål Vegard Strümke, Inga Langaas, Mette DeWan, Andrew Thomas Riemer-Sørensen, Signe Inferring feature importance with uncertainties with application to large genotype data |
title | Inferring feature importance with uncertainties with application to large genotype data |
title_full | Inferring feature importance with uncertainties with application to large genotype data |
title_fullStr | Inferring feature importance with uncertainties with application to large genotype data |
title_full_unstemmed | Inferring feature importance with uncertainties with application to large genotype data |
title_short | Inferring feature importance with uncertainties with application to large genotype data |
title_sort | inferring feature importance with uncertainties with application to large genotype data |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10038287/ https://www.ncbi.nlm.nih.gov/pubmed/36917581 http://dx.doi.org/10.1371/journal.pcbi.1010963 |
work_keys_str_mv | AT johnsenpalvegard inferringfeatureimportancewithuncertaintieswithapplicationtolargegenotypedata AT strumkeinga inferringfeatureimportancewithuncertaintieswithapplicationtolargegenotypedata AT langaasmette inferringfeatureimportancewithuncertaintieswithapplicationtolargegenotypedata AT dewanandrewthomas inferringfeatureimportancewithuncertaintieswithapplicationtolargegenotypedata AT riemersørensensigne inferringfeatureimportancewithuncertaintieswithapplicationtolargegenotypedata |