Cargando…

Inferring feature importance with uncertainties with application to large genotype data

Estimating feature importance, which is the contribution of a prediction or several predictions due to a feature, is an essential aspect of explaining data-based models. Besides explaining the model itself, an equally relevant question is which features are important in the underlying data generatin...

Descripción completa

Detalles Bibliográficos
Autores principales:	Johnsen, Pål Vegard, Strümke, Inga, Langaas, Mette, DeWan, Andrew Thomas, Riemer-Sørensen, Signe
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2023
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10038287/ https://www.ncbi.nlm.nih.gov/pubmed/36917581 http://dx.doi.org/10.1371/journal.pcbi.1010963

_version_	1784912047218098176
author	Johnsen, Pål Vegard Strümke, Inga Langaas, Mette DeWan, Andrew Thomas Riemer-Sørensen, Signe
author_facet	Johnsen, Pål Vegard Strümke, Inga Langaas, Mette DeWan, Andrew Thomas Riemer-Sørensen, Signe
author_sort	Johnsen, Pål Vegard
collection	PubMed
description	Estimating feature importance, which is the contribution of a prediction or several predictions due to a feature, is an essential aspect of explaining data-based models. Besides explaining the model itself, an equally relevant question is which features are important in the underlying data generating process. We present a Shapley-value-based framework for inferring the importance of individual features, including uncertainty in the estimator. We build upon the recently published model-agnostic feature importance score of SAGE (Shapley additive global importance) and introduce Sub-SAGE. For tree-based models, it has the advantage that it can be estimated without computationally expensive resampling. We argue that for all model types the uncertainties in our Sub-SAGE estimator can be estimated using bootstrapping and demonstrate the approach for tree ensemble methods. The framework is exemplified on synthetic data as well as large genotype data for predicting feature importance with respect to obesity.
format	Online Article Text
id	pubmed-10038287
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-100382872023-03-25 Inferring feature importance with uncertainties with application to large genotype data Johnsen, Pål Vegard Strümke, Inga Langaas, Mette DeWan, Andrew Thomas Riemer-Sørensen, Signe PLoS Comput Biol Research Article Estimating feature importance, which is the contribution of a prediction or several predictions due to a feature, is an essential aspect of explaining data-based models. Besides explaining the model itself, an equally relevant question is which features are important in the underlying data generating process. We present a Shapley-value-based framework for inferring the importance of individual features, including uncertainty in the estimator. We build upon the recently published model-agnostic feature importance score of SAGE (Shapley additive global importance) and introduce Sub-SAGE. For tree-based models, it has the advantage that it can be estimated without computationally expensive resampling. We argue that for all model types the uncertainties in our Sub-SAGE estimator can be estimated using bootstrapping and demonstrate the approach for tree ensemble methods. The framework is exemplified on synthetic data as well as large genotype data for predicting feature importance with respect to obesity. Public Library of Science 2023-03-14 /pmc/articles/PMC10038287/ /pubmed/36917581 http://dx.doi.org/10.1371/journal.pcbi.1010963 Text en © 2023 Johnsen et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Johnsen, Pål Vegard Strümke, Inga Langaas, Mette DeWan, Andrew Thomas Riemer-Sørensen, Signe Inferring feature importance with uncertainties with application to large genotype data
title	Inferring feature importance with uncertainties with application to large genotype data
title_full	Inferring feature importance with uncertainties with application to large genotype data
title_fullStr	Inferring feature importance with uncertainties with application to large genotype data
title_full_unstemmed	Inferring feature importance with uncertainties with application to large genotype data
title_short	Inferring feature importance with uncertainties with application to large genotype data
title_sort	inferring feature importance with uncertainties with application to large genotype data
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10038287/ https://www.ncbi.nlm.nih.gov/pubmed/36917581 http://dx.doi.org/10.1371/journal.pcbi.1010963
work_keys_str_mv	AT johnsenpalvegard inferringfeatureimportancewithuncertaintieswithapplicationtolargegenotypedata AT strumkeinga inferringfeatureimportancewithuncertaintieswithapplicationtolargegenotypedata AT langaasmette inferringfeatureimportancewithuncertaintieswithapplicationtolargegenotypedata AT dewanandrewthomas inferringfeatureimportancewithuncertaintieswithapplicationtolargegenotypedata AT riemersørensensigne inferringfeatureimportancewithuncertaintieswithapplicationtolargegenotypedata

Inferring feature importance with uncertainties with application to large genotype data

Ejemplares similares