Cargando…
Accessible data curation and analytics for international-scale citizen science datasets
The Covid Symptom Study, a smartphone-based surveillance study on COVID-19 symptoms in the population, is an exemplar of big data citizen science. As of May 23rd, 2021, over 5 million participants have collectively logged over 360 million self-assessment reports since its introduction in March 2020....
Autores principales: | , , , , , , , , , , , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Nature Publishing Group UK
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8608807/ https://www.ncbi.nlm.nih.gov/pubmed/34811392 http://dx.doi.org/10.1038/s41597-021-01071-x |
_version_ | 1784602808472829952 |
---|---|
author | Murray, Benjamin Kerfoot, Eric Chen, Liyuan Deng, Jie Graham, Mark S. Sudre, Carole H. Molteni, Erika Canas, Liane S. Antonelli, Michela Klaser, Kerstin Visconti, Alessia Hammers, Alexander Chan, Andrew T. Franks, Paul W. Davies, Richard Wolf, Jonathan Spector, Tim D. Steves, Claire J. Modat, Marc Ourselin, Sebastien |
author_facet | Murray, Benjamin Kerfoot, Eric Chen, Liyuan Deng, Jie Graham, Mark S. Sudre, Carole H. Molteni, Erika Canas, Liane S. Antonelli, Michela Klaser, Kerstin Visconti, Alessia Hammers, Alexander Chan, Andrew T. Franks, Paul W. Davies, Richard Wolf, Jonathan Spector, Tim D. Steves, Claire J. Modat, Marc Ourselin, Sebastien |
author_sort | Murray, Benjamin |
collection | PubMed |
description | The Covid Symptom Study, a smartphone-based surveillance study on COVID-19 symptoms in the population, is an exemplar of big data citizen science. As of May 23rd, 2021, over 5 million participants have collectively logged over 360 million self-assessment reports since its introduction in March 2020. The success of the Covid Symptom Study creates significant technical challenges around effective data curation. The primary issue is scale. The size of the dataset means that it can no longer be readily processed using standard Python-based data analytics software such as Pandas on commodity hardware. Alternative technologies exist but carry a higher technical complexity and are less accessible to many researchers. We present ExeTera, a Python-based open source software package designed to provide Pandas-like data analytics on datasets that approach terabyte scales. We present its design and capabilities, and show how it is a critical component of a data curation pipeline that enables reproducible research across an international research group for the Covid Symptom Study. |
format | Online Article Text |
id | pubmed-8608807 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | Nature Publishing Group UK |
record_format | MEDLINE/PubMed |
spelling | pubmed-86088072021-12-03 Accessible data curation and analytics for international-scale citizen science datasets Murray, Benjamin Kerfoot, Eric Chen, Liyuan Deng, Jie Graham, Mark S. Sudre, Carole H. Molteni, Erika Canas, Liane S. Antonelli, Michela Klaser, Kerstin Visconti, Alessia Hammers, Alexander Chan, Andrew T. Franks, Paul W. Davies, Richard Wolf, Jonathan Spector, Tim D. Steves, Claire J. Modat, Marc Ourselin, Sebastien Sci Data Article The Covid Symptom Study, a smartphone-based surveillance study on COVID-19 symptoms in the population, is an exemplar of big data citizen science. As of May 23rd, 2021, over 5 million participants have collectively logged over 360 million self-assessment reports since its introduction in March 2020. The success of the Covid Symptom Study creates significant technical challenges around effective data curation. The primary issue is scale. The size of the dataset means that it can no longer be readily processed using standard Python-based data analytics software such as Pandas on commodity hardware. Alternative technologies exist but carry a higher technical complexity and are less accessible to many researchers. We present ExeTera, a Python-based open source software package designed to provide Pandas-like data analytics on datasets that approach terabyte scales. We present its design and capabilities, and show how it is a critical component of a data curation pipeline that enables reproducible research across an international research group for the Covid Symptom Study. Nature Publishing Group UK 2021-11-22 /pmc/articles/PMC8608807/ /pubmed/34811392 http://dx.doi.org/10.1038/s41597-021-01071-x Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . |
spellingShingle | Article Murray, Benjamin Kerfoot, Eric Chen, Liyuan Deng, Jie Graham, Mark S. Sudre, Carole H. Molteni, Erika Canas, Liane S. Antonelli, Michela Klaser, Kerstin Visconti, Alessia Hammers, Alexander Chan, Andrew T. Franks, Paul W. Davies, Richard Wolf, Jonathan Spector, Tim D. Steves, Claire J. Modat, Marc Ourselin, Sebastien Accessible data curation and analytics for international-scale citizen science datasets |
title | Accessible data curation and analytics for international-scale citizen science datasets |
title_full | Accessible data curation and analytics for international-scale citizen science datasets |
title_fullStr | Accessible data curation and analytics for international-scale citizen science datasets |
title_full_unstemmed | Accessible data curation and analytics for international-scale citizen science datasets |
title_short | Accessible data curation and analytics for international-scale citizen science datasets |
title_sort | accessible data curation and analytics for international-scale citizen science datasets |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8608807/ https://www.ncbi.nlm.nih.gov/pubmed/34811392 http://dx.doi.org/10.1038/s41597-021-01071-x |
work_keys_str_mv | AT murraybenjamin accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets AT kerfooteric accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets AT chenliyuan accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets AT dengjie accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets AT grahammarks accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets AT sudrecaroleh accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets AT moltenierika accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets AT canaslianes accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets AT antonellimichela accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets AT klaserkerstin accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets AT viscontialessia accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets AT hammersalexander accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets AT chanandrewt accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets AT frankspaulw accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets AT daviesrichard accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets AT wolfjonathan accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets AT spectortimd accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets AT stevesclairej accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets AT modatmarc accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets AT ourselinsebastien accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets |