Cargando…
Data and its (dis)contents: A survey of dataset development and use in machine learning research
In this work, we survey a breadth of literature that has revealed the limitations of predominant practices for dataset collection and use in the field of machine learning. We cover studies that critically review the design and development of datasets with a focus on negative societal impacts and poo...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Elsevier
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8600147/ https://www.ncbi.nlm.nih.gov/pubmed/34820643 http://dx.doi.org/10.1016/j.patter.2021.100336 |
_version_ | 1784601089108082688 |
---|---|
author | Paullada, Amandalynne Raji, Inioluwa Deborah Bender, Emily M. Denton, Emily Hanna, Alex |
author_facet | Paullada, Amandalynne Raji, Inioluwa Deborah Bender, Emily M. Denton, Emily Hanna, Alex |
author_sort | Paullada, Amandalynne |
collection | PubMed |
description | In this work, we survey a breadth of literature that has revealed the limitations of predominant practices for dataset collection and use in the field of machine learning. We cover studies that critically review the design and development of datasets with a focus on negative societal impacts and poor outcomes for system performance. We also cover approaches to filtering and augmenting data and modeling techniques aimed at mitigating the impact of bias in datasets. Finally, we discuss works that have studied data practices, cultures, and disciplinary norms and discuss implications for the legal, ethical, and functional challenges the field continues to face. Based on these findings, we advocate for the use of both qualitative and quantitative approaches to more carefully document and analyze datasets during the creation and usage phases. |
format | Online Article Text |
id | pubmed-8600147 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | Elsevier |
record_format | MEDLINE/PubMed |
spelling | pubmed-86001472021-11-23 Data and its (dis)contents: A survey of dataset development and use in machine learning research Paullada, Amandalynne Raji, Inioluwa Deborah Bender, Emily M. Denton, Emily Hanna, Alex Patterns (N Y) Review In this work, we survey a breadth of literature that has revealed the limitations of predominant practices for dataset collection and use in the field of machine learning. We cover studies that critically review the design and development of datasets with a focus on negative societal impacts and poor outcomes for system performance. We also cover approaches to filtering and augmenting data and modeling techniques aimed at mitigating the impact of bias in datasets. Finally, we discuss works that have studied data practices, cultures, and disciplinary norms and discuss implications for the legal, ethical, and functional challenges the field continues to face. Based on these findings, we advocate for the use of both qualitative and quantitative approaches to more carefully document and analyze datasets during the creation and usage phases. Elsevier 2021-11-12 /pmc/articles/PMC8600147/ /pubmed/34820643 http://dx.doi.org/10.1016/j.patter.2021.100336 Text en © 2021 The Authors https://creativecommons.org/licenses/by/4.0/This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Review Paullada, Amandalynne Raji, Inioluwa Deborah Bender, Emily M. Denton, Emily Hanna, Alex Data and its (dis)contents: A survey of dataset development and use in machine learning research |
title | Data and its (dis)contents: A survey of dataset development and use in machine learning research |
title_full | Data and its (dis)contents: A survey of dataset development and use in machine learning research |
title_fullStr | Data and its (dis)contents: A survey of dataset development and use in machine learning research |
title_full_unstemmed | Data and its (dis)contents: A survey of dataset development and use in machine learning research |
title_short | Data and its (dis)contents: A survey of dataset development and use in machine learning research |
title_sort | data and its (dis)contents: a survey of dataset development and use in machine learning research |
topic | Review |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8600147/ https://www.ncbi.nlm.nih.gov/pubmed/34820643 http://dx.doi.org/10.1016/j.patter.2021.100336 |
work_keys_str_mv | AT paulladaamandalynne dataanditsdiscontentsasurveyofdatasetdevelopmentanduseinmachinelearningresearch AT rajiinioluwadeborah dataanditsdiscontentsasurveyofdatasetdevelopmentanduseinmachinelearningresearch AT benderemilym dataanditsdiscontentsasurveyofdatasetdevelopmentanduseinmachinelearningresearch AT dentonemily dataanditsdiscontentsasurveyofdatasetdevelopmentanduseinmachinelearningresearch AT hannaalex dataanditsdiscontentsasurveyofdatasetdevelopmentanduseinmachinelearningresearch |