Cargando…

Implicit data crimes: Machine learning bias arising from misuse of public data

Although open databases are an important resource in the current deep learning (DL) era, they are sometimes used “off label”: Data published for one task are used to train algorithms for a different one. This work aims to highlight that this common practice may lead to biased, overly optimistic resu...

Descripción completa

Detalles Bibliográficos
Autores principales:	Shimron, Efrat, Tamir, Jonathan I., Wang, Ke, Lustig, Michael
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	National Academy of Sciences 2022
Materias:	Physical Sciences
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9060447/ https://www.ncbi.nlm.nih.gov/pubmed/35312366 http://dx.doi.org/10.1073/pnas.2117203119

_version_	1784698504538488832
author	Shimron, Efrat Tamir, Jonathan I. Wang, Ke Lustig, Michael
author_facet	Shimron, Efrat Tamir, Jonathan I. Wang, Ke Lustig, Michael
author_sort	Shimron, Efrat
collection	PubMed
description	Although open databases are an important resource in the current deep learning (DL) era, they are sometimes used “off label”: Data published for one task are used to train algorithms for a different one. This work aims to highlight that this common practice may lead to biased, overly optimistic results. We demonstrate this phenomenon for inverse problem solvers and show how their biased performance stems from hidden data-processing pipelines. We describe two processing pipelines typical of open-access databases and study their effects on three well-established algorithms developed for MRI reconstruction: compressed sensing, dictionary learning, and DL. Our results demonstrate that all these algorithms yield systematically biased results when they are naively trained on seemingly appropriate data: The normalized rms error improves consistently with the extent of data processing, showing an artificial improvement of 25 to 48% in some cases. Because this phenomenon is not widely known, biased results sometimes are published as state of the art; we refer to that as implicit “data crimes.” This work hence aims to raise awareness regarding naive off-label usage of big data and reveal the vulnerability of modern inverse problem solvers to the resulting bias.
format	Online Article Text
id	pubmed-9060447
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	National Academy of Sciences
record_format	MEDLINE/PubMed
spelling	pubmed-90604472022-05-03 Implicit data crimes: Machine learning bias arising from misuse of public data Shimron, Efrat Tamir, Jonathan I. Wang, Ke Lustig, Michael Proc Natl Acad Sci U S A Physical Sciences Although open databases are an important resource in the current deep learning (DL) era, they are sometimes used “off label”: Data published for one task are used to train algorithms for a different one. This work aims to highlight that this common practice may lead to biased, overly optimistic results. We demonstrate this phenomenon for inverse problem solvers and show how their biased performance stems from hidden data-processing pipelines. We describe two processing pipelines typical of open-access databases and study their effects on three well-established algorithms developed for MRI reconstruction: compressed sensing, dictionary learning, and DL. Our results demonstrate that all these algorithms yield systematically biased results when they are naively trained on seemingly appropriate data: The normalized rms error improves consistently with the extent of data processing, showing an artificial improvement of 25 to 48% in some cases. Because this phenomenon is not widely known, biased results sometimes are published as state of the art; we refer to that as implicit “data crimes.” This work hence aims to raise awareness regarding naive off-label usage of big data and reveal the vulnerability of modern inverse problem solvers to the resulting bias. National Academy of Sciences 2022-03-21 2022-03-29 /pmc/articles/PMC9060447/ /pubmed/35312366 http://dx.doi.org/10.1073/pnas.2117203119 Text en Copyright © 2022 the Author(s). Published by PNAS. https://creativecommons.org/licenses/by/4.0/This open access article is distributed under Creative Commons Attribution License 4.0 (CC BY) (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle	Physical Sciences Shimron, Efrat Tamir, Jonathan I. Wang, Ke Lustig, Michael Implicit data crimes: Machine learning bias arising from misuse of public data
title	Implicit data crimes: Machine learning bias arising from misuse of public data
title_full	Implicit data crimes: Machine learning bias arising from misuse of public data
title_fullStr	Implicit data crimes: Machine learning bias arising from misuse of public data
title_full_unstemmed	Implicit data crimes: Machine learning bias arising from misuse of public data
title_short	Implicit data crimes: Machine learning bias arising from misuse of public data
title_sort	implicit data crimes: machine learning bias arising from misuse of public data
topic	Physical Sciences
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9060447/ https://www.ncbi.nlm.nih.gov/pubmed/35312366 http://dx.doi.org/10.1073/pnas.2117203119
work_keys_str_mv	AT shimronefrat implicitdatacrimesmachinelearningbiasarisingfrommisuseofpublicdata AT tamirjonathani implicitdatacrimesmachinelearningbiasarisingfrommisuseofpublicdata AT wangke implicitdatacrimesmachinelearningbiasarisingfrommisuseofpublicdata AT lustigmichael implicitdatacrimesmachinelearningbiasarisingfrommisuseofpublicdata

Implicit data crimes: Machine learning bias arising from misuse of public data

Ejemplares similares