Cargando…

Inferring feature importance with uncertainties with application to large genotype data

Estimating feature importance, which is the contribution of a prediction or several predictions due to a feature, is an essential aspect of explaining data-based models. Besides explaining the model itself, an equally relevant question is which features are important in the underlying data generatin...

Descripción completa

Detalles Bibliográficos
Autores principales: Johnsen, Pål Vegard, Strümke, Inga, Langaas, Mette, DeWan, Andrew Thomas, Riemer-Sørensen, Signe
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10038287/
https://www.ncbi.nlm.nih.gov/pubmed/36917581
http://dx.doi.org/10.1371/journal.pcbi.1010963
_version_ 1784912047218098176
author Johnsen, Pål Vegard
Strümke, Inga
Langaas, Mette
DeWan, Andrew Thomas
Riemer-Sørensen, Signe
author_facet Johnsen, Pål Vegard
Strümke, Inga
Langaas, Mette
DeWan, Andrew Thomas
Riemer-Sørensen, Signe
author_sort Johnsen, Pål Vegard
collection PubMed
description Estimating feature importance, which is the contribution of a prediction or several predictions due to a feature, is an essential aspect of explaining data-based models. Besides explaining the model itself, an equally relevant question is which features are important in the underlying data generating process. We present a Shapley-value-based framework for inferring the importance of individual features, including uncertainty in the estimator. We build upon the recently published model-agnostic feature importance score of SAGE (Shapley additive global importance) and introduce Sub-SAGE. For tree-based models, it has the advantage that it can be estimated without computationally expensive resampling. We argue that for all model types the uncertainties in our Sub-SAGE estimator can be estimated using bootstrapping and demonstrate the approach for tree ensemble methods. The framework is exemplified on synthetic data as well as large genotype data for predicting feature importance with respect to obesity.
format Online
Article
Text
id pubmed-10038287
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-100382872023-03-25 Inferring feature importance with uncertainties with application to large genotype data Johnsen, Pål Vegard Strümke, Inga Langaas, Mette DeWan, Andrew Thomas Riemer-Sørensen, Signe PLoS Comput Biol Research Article Estimating feature importance, which is the contribution of a prediction or several predictions due to a feature, is an essential aspect of explaining data-based models. Besides explaining the model itself, an equally relevant question is which features are important in the underlying data generating process. We present a Shapley-value-based framework for inferring the importance of individual features, including uncertainty in the estimator. We build upon the recently published model-agnostic feature importance score of SAGE (Shapley additive global importance) and introduce Sub-SAGE. For tree-based models, it has the advantage that it can be estimated without computationally expensive resampling. We argue that for all model types the uncertainties in our Sub-SAGE estimator can be estimated using bootstrapping and demonstrate the approach for tree ensemble methods. The framework is exemplified on synthetic data as well as large genotype data for predicting feature importance with respect to obesity. Public Library of Science 2023-03-14 /pmc/articles/PMC10038287/ /pubmed/36917581 http://dx.doi.org/10.1371/journal.pcbi.1010963 Text en © 2023 Johnsen et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Johnsen, Pål Vegard
Strümke, Inga
Langaas, Mette
DeWan, Andrew Thomas
Riemer-Sørensen, Signe
Inferring feature importance with uncertainties with application to large genotype data
title Inferring feature importance with uncertainties with application to large genotype data
title_full Inferring feature importance with uncertainties with application to large genotype data
title_fullStr Inferring feature importance with uncertainties with application to large genotype data
title_full_unstemmed Inferring feature importance with uncertainties with application to large genotype data
title_short Inferring feature importance with uncertainties with application to large genotype data
title_sort inferring feature importance with uncertainties with application to large genotype data
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10038287/
https://www.ncbi.nlm.nih.gov/pubmed/36917581
http://dx.doi.org/10.1371/journal.pcbi.1010963
work_keys_str_mv AT johnsenpalvegard inferringfeatureimportancewithuncertaintieswithapplicationtolargegenotypedata
AT strumkeinga inferringfeatureimportancewithuncertaintieswithapplicationtolargegenotypedata
AT langaasmette inferringfeatureimportancewithuncertaintieswithapplicationtolargegenotypedata
AT dewanandrewthomas inferringfeatureimportancewithuncertaintieswithapplicationtolargegenotypedata
AT riemersørensensigne inferringfeatureimportancewithuncertaintieswithapplicationtolargegenotypedata