Cargando…

Doppelgänger spotting in biomedical gene expression data

Doppelgänger effects (DEs) occur when samples exhibit chance similarities such that, when split across training and validation sets, inflates the trained machine learning (ML) model performance. This inflationary effect causes misleading confidence on the deployability of the model. Thus, so far, th...

Descripción completa

Detalles Bibliográficos
Autores principales: Wang, Li Rong, Choy, Xin Yun, Goh, Wilson Wen Bin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9382272/
https://www.ncbi.nlm.nih.gov/pubmed/35992056
http://dx.doi.org/10.1016/j.isci.2022.104788
_version_ 1784769252016783360
author Wang, Li Rong
Choy, Xin Yun
Goh, Wilson Wen Bin
author_facet Wang, Li Rong
Choy, Xin Yun
Goh, Wilson Wen Bin
author_sort Wang, Li Rong
collection PubMed
description Doppelgänger effects (DEs) occur when samples exhibit chance similarities such that, when split across training and validation sets, inflates the trained machine learning (ML) model performance. This inflationary effect causes misleading confidence on the deployability of the model. Thus, so far, there are no tools for doppelgänger identification or standard practices to manage their confounding implications. We present doppelgangerIdentifier, a software suite for doppelgänger identification and verification. Applying doppelgangerIdentifier across a multitude of diseases and data types, we show the pervasive nature of DEs in biomedical gene expression data. We also provide guidelines toward proper doppelgänger identification by exploring the ramifications of lingering batch effects from batch imbalances on the sensitivity of our doppelgänger identification algorithm. We suggest doppelgänger verification as a useful procedure to establish baselines for model evaluation that may inform on whether feature selection and ML on the data set may yield meaningful insights.
format Online
Article
Text
id pubmed-9382272
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-93822722022-08-18 Doppelgänger spotting in biomedical gene expression data Wang, Li Rong Choy, Xin Yun Goh, Wilson Wen Bin iScience Article Doppelgänger effects (DEs) occur when samples exhibit chance similarities such that, when split across training and validation sets, inflates the trained machine learning (ML) model performance. This inflationary effect causes misleading confidence on the deployability of the model. Thus, so far, there are no tools for doppelgänger identification or standard practices to manage their confounding implications. We present doppelgangerIdentifier, a software suite for doppelgänger identification and verification. Applying doppelgangerIdentifier across a multitude of diseases and data types, we show the pervasive nature of DEs in biomedical gene expression data. We also provide guidelines toward proper doppelgänger identification by exploring the ramifications of lingering batch effects from batch imbalances on the sensitivity of our doppelgänger identification algorithm. We suggest doppelgänger verification as a useful procedure to establish baselines for model evaluation that may inform on whether feature selection and ML on the data set may yield meaningful insights. Elsevier 2022-07-19 /pmc/articles/PMC9382272/ /pubmed/35992056 http://dx.doi.org/10.1016/j.isci.2022.104788 Text en © 2022 The Author(s) https://creativecommons.org/licenses/by-nc-nd/4.0/This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
spellingShingle Article
Wang, Li Rong
Choy, Xin Yun
Goh, Wilson Wen Bin
Doppelgänger spotting in biomedical gene expression data
title Doppelgänger spotting in biomedical gene expression data
title_full Doppelgänger spotting in biomedical gene expression data
title_fullStr Doppelgänger spotting in biomedical gene expression data
title_full_unstemmed Doppelgänger spotting in biomedical gene expression data
title_short Doppelgänger spotting in biomedical gene expression data
title_sort doppelgänger spotting in biomedical gene expression data
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9382272/
https://www.ncbi.nlm.nih.gov/pubmed/35992056
http://dx.doi.org/10.1016/j.isci.2022.104788
work_keys_str_mv AT wanglirong doppelgangerspottinginbiomedicalgeneexpressiondata
AT choyxinyun doppelgangerspottinginbiomedicalgeneexpressiondata
AT gohwilsonwenbin doppelgangerspottinginbiomedicalgeneexpressiondata