Cargando…

Co-Inference of Data Mislabelings Reveals Improved Models in Genomics and Breast Cancer Diagnostics

Mislabeling of cases as well as controls in case–control studies is a frequent source of strong bias in prognostic and diagnostic tests and algorithms. Common data processing methods available to the researchers in the biomedical community do not allow for consistent and robust treatment of labeled...

Descripción completa

Detalles Bibliográficos
Autores principales:	Gerber, Susanne, Pospisil, Lukas, Sys, Stanislav, Hewel, Charlotte, Torkamani, Ali, Horenko, Illia
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Frontiers Media S.A. 2022
Materias:	Artificial Intelligence
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8766632/ https://www.ncbi.nlm.nih.gov/pubmed/35072059 http://dx.doi.org/10.3389/frai.2021.739432

_version_	1784634570249863168
author	Gerber, Susanne Pospisil, Lukas Sys, Stanislav Hewel, Charlotte Torkamani, Ali Horenko, Illia
author_facet	Gerber, Susanne Pospisil, Lukas Sys, Stanislav Hewel, Charlotte Torkamani, Ali Horenko, Illia
author_sort	Gerber, Susanne
collection	PubMed
description	Mislabeling of cases as well as controls in case–control studies is a frequent source of strong bias in prognostic and diagnostic tests and algorithms. Common data processing methods available to the researchers in the biomedical community do not allow for consistent and robust treatment of labeled data in the situations where both, the case and the control groups, contain a non-negligible proportion of mislabeled data instances. This is an especially prominent issue in studies regarding late-onset conditions, where individuals who may convert to cases may populate the control group, and for screening studies that often have high false-positive/-negative rates. To address this problem, we propose a method for a simultaneous robust inference of Lasso reduced discriminative models and of latent group-specific mislabeling risks, not requiring any exactly labeled data. We apply it to a standard breast cancer imaging dataset and infer the mislabeling probabilities (being rates of false-negative and false-positive core-needle biopsies) together with a small set of simple diagnostic rules, outperforming the state-of-the-art BI-RADS diagnostics on these data. The inferred mislabeling rates for breast cancer biopsies agree with the published purely empirical studies. Applying the method to human genomic data from a healthy-ageing cohort reveals a previously unreported compact combination of single-nucleotide polymorphisms that are strongly associated with a healthy-ageing phenotype for Caucasians. It determines that 7.5% of Caucasians in the 1000 Genomes dataset (selected as a control group) carry a pattern characteristic of healthy ageing.
format	Online Article Text
id	pubmed-8766632
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Frontiers Media S.A.
record_format	MEDLINE/PubMed
spelling	pubmed-87666322022-01-20 Co-Inference of Data Mislabelings Reveals Improved Models in Genomics and Breast Cancer Diagnostics Gerber, Susanne Pospisil, Lukas Sys, Stanislav Hewel, Charlotte Torkamani, Ali Horenko, Illia Front Artif Intell Artificial Intelligence Mislabeling of cases as well as controls in case–control studies is a frequent source of strong bias in prognostic and diagnostic tests and algorithms. Common data processing methods available to the researchers in the biomedical community do not allow for consistent and robust treatment of labeled data in the situations where both, the case and the control groups, contain a non-negligible proportion of mislabeled data instances. This is an especially prominent issue in studies regarding late-onset conditions, where individuals who may convert to cases may populate the control group, and for screening studies that often have high false-positive/-negative rates. To address this problem, we propose a method for a simultaneous robust inference of Lasso reduced discriminative models and of latent group-specific mislabeling risks, not requiring any exactly labeled data. We apply it to a standard breast cancer imaging dataset and infer the mislabeling probabilities (being rates of false-negative and false-positive core-needle biopsies) together with a small set of simple diagnostic rules, outperforming the state-of-the-art BI-RADS diagnostics on these data. The inferred mislabeling rates for breast cancer biopsies agree with the published purely empirical studies. Applying the method to human genomic data from a healthy-ageing cohort reveals a previously unreported compact combination of single-nucleotide polymorphisms that are strongly associated with a healthy-ageing phenotype for Caucasians. It determines that 7.5% of Caucasians in the 1000 Genomes dataset (selected as a control group) carry a pattern characteristic of healthy ageing. Frontiers Media S.A. 2022-01-05 /pmc/articles/PMC8766632/ /pubmed/35072059 http://dx.doi.org/10.3389/frai.2021.739432 Text en Copyright © 2022 Gerber, Pospisil, Sys, Hewel, Torkamani and Horenko. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle	Artificial Intelligence Gerber, Susanne Pospisil, Lukas Sys, Stanislav Hewel, Charlotte Torkamani, Ali Horenko, Illia Co-Inference of Data Mislabelings Reveals Improved Models in Genomics and Breast Cancer Diagnostics
title	Co-Inference of Data Mislabelings Reveals Improved Models in Genomics and Breast Cancer Diagnostics
title_full	Co-Inference of Data Mislabelings Reveals Improved Models in Genomics and Breast Cancer Diagnostics
title_fullStr	Co-Inference of Data Mislabelings Reveals Improved Models in Genomics and Breast Cancer Diagnostics
title_full_unstemmed	Co-Inference of Data Mislabelings Reveals Improved Models in Genomics and Breast Cancer Diagnostics
title_short	Co-Inference of Data Mislabelings Reveals Improved Models in Genomics and Breast Cancer Diagnostics
title_sort	co-inference of data mislabelings reveals improved models in genomics and breast cancer diagnostics
topic	Artificial Intelligence
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8766632/ https://www.ncbi.nlm.nih.gov/pubmed/35072059 http://dx.doi.org/10.3389/frai.2021.739432
work_keys_str_mv	AT gerbersusanne coinferenceofdatamislabelingsrevealsimprovedmodelsingenomicsandbreastcancerdiagnostics AT pospisillukas coinferenceofdatamislabelingsrevealsimprovedmodelsingenomicsandbreastcancerdiagnostics AT sysstanislav coinferenceofdatamislabelingsrevealsimprovedmodelsingenomicsandbreastcancerdiagnostics AT hewelcharlotte coinferenceofdatamislabelingsrevealsimprovedmodelsingenomicsandbreastcancerdiagnostics AT torkamaniali coinferenceofdatamislabelingsrevealsimprovedmodelsingenomicsandbreastcancerdiagnostics AT horenkoillia coinferenceofdatamislabelingsrevealsimprovedmodelsingenomicsandbreastcancerdiagnostics

Co-Inference of Data Mislabelings Reveals Improved Models in Genomics and Breast Cancer Diagnostics

Ejemplares similares