Cargando…

Data and its (dis)contents: A survey of dataset development and use in machine learning research

In this work, we survey a breadth of literature that has revealed the limitations of predominant practices for dataset collection and use in the field of machine learning. We cover studies that critically review the design and development of datasets with a focus on negative societal impacts and poo...

Descripción completa

Detalles Bibliográficos
Autores principales: Paullada, Amandalynne, Raji, Inioluwa Deborah, Bender, Emily M., Denton, Emily, Hanna, Alex
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8600147/
https://www.ncbi.nlm.nih.gov/pubmed/34820643
http://dx.doi.org/10.1016/j.patter.2021.100336
_version_ 1784601089108082688
author Paullada, Amandalynne
Raji, Inioluwa Deborah
Bender, Emily M.
Denton, Emily
Hanna, Alex
author_facet Paullada, Amandalynne
Raji, Inioluwa Deborah
Bender, Emily M.
Denton, Emily
Hanna, Alex
author_sort Paullada, Amandalynne
collection PubMed
description In this work, we survey a breadth of literature that has revealed the limitations of predominant practices for dataset collection and use in the field of machine learning. We cover studies that critically review the design and development of datasets with a focus on negative societal impacts and poor outcomes for system performance. We also cover approaches to filtering and augmenting data and modeling techniques aimed at mitigating the impact of bias in datasets. Finally, we discuss works that have studied data practices, cultures, and disciplinary norms and discuss implications for the legal, ethical, and functional challenges the field continues to face. Based on these findings, we advocate for the use of both qualitative and quantitative approaches to more carefully document and analyze datasets during the creation and usage phases.
format Online
Article
Text
id pubmed-8600147
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-86001472021-11-23 Data and its (dis)contents: A survey of dataset development and use in machine learning research Paullada, Amandalynne Raji, Inioluwa Deborah Bender, Emily M. Denton, Emily Hanna, Alex Patterns (N Y) Review In this work, we survey a breadth of literature that has revealed the limitations of predominant practices for dataset collection and use in the field of machine learning. We cover studies that critically review the design and development of datasets with a focus on negative societal impacts and poor outcomes for system performance. We also cover approaches to filtering and augmenting data and modeling techniques aimed at mitigating the impact of bias in datasets. Finally, we discuss works that have studied data practices, cultures, and disciplinary norms and discuss implications for the legal, ethical, and functional challenges the field continues to face. Based on these findings, we advocate for the use of both qualitative and quantitative approaches to more carefully document and analyze datasets during the creation and usage phases. Elsevier 2021-11-12 /pmc/articles/PMC8600147/ /pubmed/34820643 http://dx.doi.org/10.1016/j.patter.2021.100336 Text en © 2021 The Authors https://creativecommons.org/licenses/by/4.0/This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Review
Paullada, Amandalynne
Raji, Inioluwa Deborah
Bender, Emily M.
Denton, Emily
Hanna, Alex
Data and its (dis)contents: A survey of dataset development and use in machine learning research
title Data and its (dis)contents: A survey of dataset development and use in machine learning research
title_full Data and its (dis)contents: A survey of dataset development and use in machine learning research
title_fullStr Data and its (dis)contents: A survey of dataset development and use in machine learning research
title_full_unstemmed Data and its (dis)contents: A survey of dataset development and use in machine learning research
title_short Data and its (dis)contents: A survey of dataset development and use in machine learning research
title_sort data and its (dis)contents: a survey of dataset development and use in machine learning research
topic Review
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8600147/
https://www.ncbi.nlm.nih.gov/pubmed/34820643
http://dx.doi.org/10.1016/j.patter.2021.100336
work_keys_str_mv AT paulladaamandalynne dataanditsdiscontentsasurveyofdatasetdevelopmentanduseinmachinelearningresearch
AT rajiinioluwadeborah dataanditsdiscontentsasurveyofdatasetdevelopmentanduseinmachinelearningresearch
AT benderemilym dataanditsdiscontentsasurveyofdatasetdevelopmentanduseinmachinelearningresearch
AT dentonemily dataanditsdiscontentsasurveyofdatasetdevelopmentanduseinmachinelearningresearch
AT hannaalex dataanditsdiscontentsasurveyofdatasetdevelopmentanduseinmachinelearningresearch