Cargando…

Potential limitations in COVID-19 machine learning due to data source variability: A case study in the nCov2019 dataset

OBJECTIVE: The lack of representative coronavirus disease 2019 (COVID-19) data is a bottleneck for reliable and generalizable machine learning. Data sharing is insufficient without data quality, in which source variability plays an important role. We showcase and discuss potential biases from data s...

Descripción completa

Detalles Bibliográficos
Autores principales: Sáez, Carlos, Romero, Nekane, Conejero, J Alberto, García-Gómez, Juan M
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7797735/
https://www.ncbi.nlm.nih.gov/pubmed/33027509
http://dx.doi.org/10.1093/jamia/ocaa258
_version_ 1783634931637813248
author Sáez, Carlos
Romero, Nekane
Conejero, J Alberto
García-Gómez, Juan M
author_facet Sáez, Carlos
Romero, Nekane
Conejero, J Alberto
García-Gómez, Juan M
author_sort Sáez, Carlos
collection PubMed
description OBJECTIVE: The lack of representative coronavirus disease 2019 (COVID-19) data is a bottleneck for reliable and generalizable machine learning. Data sharing is insufficient without data quality, in which source variability plays an important role. We showcase and discuss potential biases from data source variability for COVID-19 machine learning. MATERIALS AND METHODS: We used the publicly available nCov2019 dataset, including patient-level data from several countries. We aimed to the discovery and classification of severity subgroups using symptoms and comorbidities. RESULTS: Cases from the 2 countries with the highest prevalence were divided into separate subgroups with distinct severity manifestations. This variability can reduce the representativeness of training data with respect the model target populations and increase model complexity at risk of overfitting. CONCLUSIONS: Data source variability is a potential contributor to bias in distributed research networks. We call for systematic assessment and reporting of data source variability and data quality in COVID-19 data sharing, as key information for reliable and generalizable machine learning.
format Online
Article
Text
id pubmed-7797735
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-77977352021-01-12 Potential limitations in COVID-19 machine learning due to data source variability: A case study in the nCov2019 dataset Sáez, Carlos Romero, Nekane Conejero, J Alberto García-Gómez, Juan M J Am Med Inform Assoc Brief Communications OBJECTIVE: The lack of representative coronavirus disease 2019 (COVID-19) data is a bottleneck for reliable and generalizable machine learning. Data sharing is insufficient without data quality, in which source variability plays an important role. We showcase and discuss potential biases from data source variability for COVID-19 machine learning. MATERIALS AND METHODS: We used the publicly available nCov2019 dataset, including patient-level data from several countries. We aimed to the discovery and classification of severity subgroups using symptoms and comorbidities. RESULTS: Cases from the 2 countries with the highest prevalence were divided into separate subgroups with distinct severity manifestations. This variability can reduce the representativeness of training data with respect the model target populations and increase model complexity at risk of overfitting. CONCLUSIONS: Data source variability is a potential contributor to bias in distributed research networks. We call for systematic assessment and reporting of data source variability and data quality in COVID-19 data sharing, as key information for reliable and generalizable machine learning. Oxford University Press 2020-10-07 /pmc/articles/PMC7797735/ /pubmed/33027509 http://dx.doi.org/10.1093/jamia/ocaa258 Text en © The Author(s) 2020. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For permissions, please email: journals.permissions@oup.com https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
spellingShingle Brief Communications
Sáez, Carlos
Romero, Nekane
Conejero, J Alberto
García-Gómez, Juan M
Potential limitations in COVID-19 machine learning due to data source variability: A case study in the nCov2019 dataset
title Potential limitations in COVID-19 machine learning due to data source variability: A case study in the nCov2019 dataset
title_full Potential limitations in COVID-19 machine learning due to data source variability: A case study in the nCov2019 dataset
title_fullStr Potential limitations in COVID-19 machine learning due to data source variability: A case study in the nCov2019 dataset
title_full_unstemmed Potential limitations in COVID-19 machine learning due to data source variability: A case study in the nCov2019 dataset
title_short Potential limitations in COVID-19 machine learning due to data source variability: A case study in the nCov2019 dataset
title_sort potential limitations in covid-19 machine learning due to data source variability: a case study in the ncov2019 dataset
topic Brief Communications
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7797735/
https://www.ncbi.nlm.nih.gov/pubmed/33027509
http://dx.doi.org/10.1093/jamia/ocaa258
work_keys_str_mv AT saezcarlos potentiallimitationsincovid19machinelearningduetodatasourcevariabilityacasestudyinthencov2019dataset
AT romeronekane potentiallimitationsincovid19machinelearningduetodatasourcevariabilityacasestudyinthencov2019dataset
AT conejerojalberto potentiallimitationsincovid19machinelearningduetodatasourcevariabilityacasestudyinthencov2019dataset
AT garciagomezjuanm potentiallimitationsincovid19machinelearningduetodatasourcevariabilityacasestudyinthencov2019dataset