Cargando…

An Approach to Identifying and Quantifying Bias in Biomedical Data

Data biases are a known impediment to the development of trustworthy machine learning models and their application to many biomedical problems. When biased data is suspected, the assumption that the labeled data is representative of the population must be relaxed and methods that exploit a typically...

Descripción completa

Detalles Bibliográficos
Autores principales:	De Paolis Kaluza, M. Clara, Jain, Shantanu, Radivojac, Predrag
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	2023
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9782737/ https://www.ncbi.nlm.nih.gov/pubmed/36540987

_version_	1784857410705293312
author	De Paolis Kaluza, M. Clara Jain, Shantanu Radivojac, Predrag
author_facet	De Paolis Kaluza, M. Clara Jain, Shantanu Radivojac, Predrag
author_sort	De Paolis Kaluza, M. Clara
collection	PubMed
description	Data biases are a known impediment to the development of trustworthy machine learning models and their application to many biomedical problems. When biased data is suspected, the assumption that the labeled data is representative of the population must be relaxed and methods that exploit a typically representative unlabeled data must be developed. To mitigate the adverse effects of unrepresentative data, we consider a binary semi-supervised setting and focus on identifying whether the labeled data is biased and to what extent. We assume that the class-conditional distributions were generated by a family of component distributions represented at different proportions in labeled and unlabeled data. We also assume that the training data can be transformed to and subsequently modeled by a nested mixture of multivariate Gaussian distributions. We then develop a multi-sample expectation-maximization algorithm that learns all individual and shared parameters of the model from the combined data. Using these parameters, we develop a statistical test for the presence of the general form of bias in labeled data and estimate the level of this bias by computing the distance between corresponding class-conditional distributions in labeled and unlabeled data. We first study the new methods on synthetic data to understand their behavior and then apply them to real-world biomedical data to provide evidence that the bias estimation procedure is both possible and effective.
format	Online Article Text
id	pubmed-9782737
institution	National Center for Biotechnology Information
language	English
publishDate	2023
record_format	MEDLINE/PubMed
spelling	pubmed-97827372023-01-01 An Approach to Identifying and Quantifying Bias in Biomedical Data De Paolis Kaluza, M. Clara Jain, Shantanu Radivojac, Predrag Pac Symp Biocomput Article Data biases are a known impediment to the development of trustworthy machine learning models and their application to many biomedical problems. When biased data is suspected, the assumption that the labeled data is representative of the population must be relaxed and methods that exploit a typically representative unlabeled data must be developed. To mitigate the adverse effects of unrepresentative data, we consider a binary semi-supervised setting and focus on identifying whether the labeled data is biased and to what extent. We assume that the class-conditional distributions were generated by a family of component distributions represented at different proportions in labeled and unlabeled data. We also assume that the training data can be transformed to and subsequently modeled by a nested mixture of multivariate Gaussian distributions. We then develop a multi-sample expectation-maximization algorithm that learns all individual and shared parameters of the model from the combined data. Using these parameters, we develop a statistical test for the presence of the general form of bias in labeled data and estimate the level of this bias by computing the distance between corresponding class-conditional distributions in labeled and unlabeled data. We first study the new methods on synthetic data to understand their behavior and then apply them to real-world biomedical data to provide evidence that the bias estimation procedure is both possible and effective. 2023 /pmc/articles/PMC9782737/ /pubmed/36540987 Text en https://creativecommons.org/licenses/by/4.0/Open Access chapter published by World Scientific Publishing Company and distributed under the terms of the Creative Commons Attribution Non-Commercial (CC BY-NC) 4.0 License.
spellingShingle	Article De Paolis Kaluza, M. Clara Jain, Shantanu Radivojac, Predrag An Approach to Identifying and Quantifying Bias in Biomedical Data
title	An Approach to Identifying and Quantifying Bias in Biomedical Data
title_full	An Approach to Identifying and Quantifying Bias in Biomedical Data
title_fullStr	An Approach to Identifying and Quantifying Bias in Biomedical Data
title_full_unstemmed	An Approach to Identifying and Quantifying Bias in Biomedical Data
title_short	An Approach to Identifying and Quantifying Bias in Biomedical Data
title_sort	approach to identifying and quantifying bias in biomedical data
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9782737/ https://www.ncbi.nlm.nih.gov/pubmed/36540987
work_keys_str_mv	AT depaoliskaluzamclara anapproachtoidentifyingandquantifyingbiasinbiomedicaldata AT jainshantanu anapproachtoidentifyingandquantifyingbiasinbiomedicaldata AT radivojacpredrag anapproachtoidentifyingandquantifyingbiasinbiomedicaldata AT depaoliskaluzamclara approachtoidentifyingandquantifyingbiasinbiomedicaldata AT jainshantanu approachtoidentifyingandquantifyingbiasinbiomedicaldata AT radivojacpredrag approachtoidentifyingandquantifyingbiasinbiomedicaldata

An Approach to Identifying and Quantifying Bias in Biomedical Data

Ejemplares similares