Cargando…
An Approach to Identifying and Quantifying Bias in Biomedical Data
Data biases are a known impediment to the development of trustworthy machine learning models and their application to many biomedical problems. When biased data is suspected, the assumption that the labeled data is representative of the population must be relaxed and methods that exploit a typically...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9782737/ https://www.ncbi.nlm.nih.gov/pubmed/36540987 |
_version_ | 1784857410705293312 |
---|---|
author | De Paolis Kaluza, M. Clara Jain, Shantanu Radivojac, Predrag |
author_facet | De Paolis Kaluza, M. Clara Jain, Shantanu Radivojac, Predrag |
author_sort | De Paolis Kaluza, M. Clara |
collection | PubMed |
description | Data biases are a known impediment to the development of trustworthy machine learning models and their application to many biomedical problems. When biased data is suspected, the assumption that the labeled data is representative of the population must be relaxed and methods that exploit a typically representative unlabeled data must be developed. To mitigate the adverse effects of unrepresentative data, we consider a binary semi-supervised setting and focus on identifying whether the labeled data is biased and to what extent. We assume that the class-conditional distributions were generated by a family of component distributions represented at different proportions in labeled and unlabeled data. We also assume that the training data can be transformed to and subsequently modeled by a nested mixture of multivariate Gaussian distributions. We then develop a multi-sample expectation-maximization algorithm that learns all individual and shared parameters of the model from the combined data. Using these parameters, we develop a statistical test for the presence of the general form of bias in labeled data and estimate the level of this bias by computing the distance between corresponding class-conditional distributions in labeled and unlabeled data. We first study the new methods on synthetic data to understand their behavior and then apply them to real-world biomedical data to provide evidence that the bias estimation procedure is both possible and effective. |
format | Online Article Text |
id | pubmed-9782737 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
record_format | MEDLINE/PubMed |
spelling | pubmed-97827372023-01-01 An Approach to Identifying and Quantifying Bias in Biomedical Data De Paolis Kaluza, M. Clara Jain, Shantanu Radivojac, Predrag Pac Symp Biocomput Article Data biases are a known impediment to the development of trustworthy machine learning models and their application to many biomedical problems. When biased data is suspected, the assumption that the labeled data is representative of the population must be relaxed and methods that exploit a typically representative unlabeled data must be developed. To mitigate the adverse effects of unrepresentative data, we consider a binary semi-supervised setting and focus on identifying whether the labeled data is biased and to what extent. We assume that the class-conditional distributions were generated by a family of component distributions represented at different proportions in labeled and unlabeled data. We also assume that the training data can be transformed to and subsequently modeled by a nested mixture of multivariate Gaussian distributions. We then develop a multi-sample expectation-maximization algorithm that learns all individual and shared parameters of the model from the combined data. Using these parameters, we develop a statistical test for the presence of the general form of bias in labeled data and estimate the level of this bias by computing the distance between corresponding class-conditional distributions in labeled and unlabeled data. We first study the new methods on synthetic data to understand their behavior and then apply them to real-world biomedical data to provide evidence that the bias estimation procedure is both possible and effective. 2023 /pmc/articles/PMC9782737/ /pubmed/36540987 Text en https://creativecommons.org/licenses/by/4.0/Open Access chapter published by World Scientific Publishing Company and distributed under the terms of the Creative Commons Attribution Non-Commercial (CC BY-NC) 4.0 License. |
spellingShingle | Article De Paolis Kaluza, M. Clara Jain, Shantanu Radivojac, Predrag An Approach to Identifying and Quantifying Bias in Biomedical Data |
title | An Approach to Identifying and Quantifying Bias in Biomedical Data |
title_full | An Approach to Identifying and Quantifying Bias in Biomedical Data |
title_fullStr | An Approach to Identifying and Quantifying Bias in Biomedical Data |
title_full_unstemmed | An Approach to Identifying and Quantifying Bias in Biomedical Data |
title_short | An Approach to Identifying and Quantifying Bias in Biomedical Data |
title_sort | approach to identifying and quantifying bias in biomedical data |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9782737/ https://www.ncbi.nlm.nih.gov/pubmed/36540987 |
work_keys_str_mv | AT depaoliskaluzamclara anapproachtoidentifyingandquantifyingbiasinbiomedicaldata AT jainshantanu anapproachtoidentifyingandquantifyingbiasinbiomedicaldata AT radivojacpredrag anapproachtoidentifyingandquantifyingbiasinbiomedicaldata AT depaoliskaluzamclara approachtoidentifyingandquantifyingbiasinbiomedicaldata AT jainshantanu approachtoidentifyingandquantifyingbiasinbiomedicaldata AT radivojacpredrag approachtoidentifyingandquantifyingbiasinbiomedicaldata |