Cargando…

An Approach to Identifying and Quantifying Bias in Biomedical Data

Data biases are a known impediment to the development of trustworthy machine learning models and their application to many biomedical problems. When biased data is suspected, the assumption that the labeled data is representative of the population must be relaxed and methods that exploit a typically...

Descripción completa

Detalles Bibliográficos
Autores principales: De Paolis Kaluza, M. Clara, Jain, Shantanu, Radivojac, Predrag
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9782737/
https://www.ncbi.nlm.nih.gov/pubmed/36540987
_version_ 1784857410705293312
author De Paolis Kaluza, M. Clara
Jain, Shantanu
Radivojac, Predrag
author_facet De Paolis Kaluza, M. Clara
Jain, Shantanu
Radivojac, Predrag
author_sort De Paolis Kaluza, M. Clara
collection PubMed
description Data biases are a known impediment to the development of trustworthy machine learning models and their application to many biomedical problems. When biased data is suspected, the assumption that the labeled data is representative of the population must be relaxed and methods that exploit a typically representative unlabeled data must be developed. To mitigate the adverse effects of unrepresentative data, we consider a binary semi-supervised setting and focus on identifying whether the labeled data is biased and to what extent. We assume that the class-conditional distributions were generated by a family of component distributions represented at different proportions in labeled and unlabeled data. We also assume that the training data can be transformed to and subsequently modeled by a nested mixture of multivariate Gaussian distributions. We then develop a multi-sample expectation-maximization algorithm that learns all individual and shared parameters of the model from the combined data. Using these parameters, we develop a statistical test for the presence of the general form of bias in labeled data and estimate the level of this bias by computing the distance between corresponding class-conditional distributions in labeled and unlabeled data. We first study the new methods on synthetic data to understand their behavior and then apply them to real-world biomedical data to provide evidence that the bias estimation procedure is both possible and effective.
format Online
Article
Text
id pubmed-9782737
institution National Center for Biotechnology Information
language English
publishDate 2023
record_format MEDLINE/PubMed
spelling pubmed-97827372023-01-01 An Approach to Identifying and Quantifying Bias in Biomedical Data De Paolis Kaluza, M. Clara Jain, Shantanu Radivojac, Predrag Pac Symp Biocomput Article Data biases are a known impediment to the development of trustworthy machine learning models and their application to many biomedical problems. When biased data is suspected, the assumption that the labeled data is representative of the population must be relaxed and methods that exploit a typically representative unlabeled data must be developed. To mitigate the adverse effects of unrepresentative data, we consider a binary semi-supervised setting and focus on identifying whether the labeled data is biased and to what extent. We assume that the class-conditional distributions were generated by a family of component distributions represented at different proportions in labeled and unlabeled data. We also assume that the training data can be transformed to and subsequently modeled by a nested mixture of multivariate Gaussian distributions. We then develop a multi-sample expectation-maximization algorithm that learns all individual and shared parameters of the model from the combined data. Using these parameters, we develop a statistical test for the presence of the general form of bias in labeled data and estimate the level of this bias by computing the distance between corresponding class-conditional distributions in labeled and unlabeled data. We first study the new methods on synthetic data to understand their behavior and then apply them to real-world biomedical data to provide evidence that the bias estimation procedure is both possible and effective. 2023 /pmc/articles/PMC9782737/ /pubmed/36540987 Text en https://creativecommons.org/licenses/by/4.0/Open Access chapter published by World Scientific Publishing Company and distributed under the terms of the Creative Commons Attribution Non-Commercial (CC BY-NC) 4.0 License.
spellingShingle Article
De Paolis Kaluza, M. Clara
Jain, Shantanu
Radivojac, Predrag
An Approach to Identifying and Quantifying Bias in Biomedical Data
title An Approach to Identifying and Quantifying Bias in Biomedical Data
title_full An Approach to Identifying and Quantifying Bias in Biomedical Data
title_fullStr An Approach to Identifying and Quantifying Bias in Biomedical Data
title_full_unstemmed An Approach to Identifying and Quantifying Bias in Biomedical Data
title_short An Approach to Identifying and Quantifying Bias in Biomedical Data
title_sort approach to identifying and quantifying bias in biomedical data
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9782737/
https://www.ncbi.nlm.nih.gov/pubmed/36540987
work_keys_str_mv AT depaoliskaluzamclara anapproachtoidentifyingandquantifyingbiasinbiomedicaldata
AT jainshantanu anapproachtoidentifyingandquantifyingbiasinbiomedicaldata
AT radivojacpredrag anapproachtoidentifyingandquantifyingbiasinbiomedicaldata
AT depaoliskaluzamclara approachtoidentifyingandquantifyingbiasinbiomedicaldata
AT jainshantanu approachtoidentifyingandquantifyingbiasinbiomedicaldata
AT radivojacpredrag approachtoidentifyingandquantifyingbiasinbiomedicaldata