Cargando…

Accessible data curation and analytics for international-scale citizen science datasets

The Covid Symptom Study, a smartphone-based surveillance study on COVID-19 symptoms in the population, is an exemplar of big data citizen science. As of May 23rd, 2021, over 5 million participants have collectively logged over 360 million self-assessment reports since its introduction in March 2020....

Descripción completa

Detalles Bibliográficos
Autores principales: Murray, Benjamin, Kerfoot, Eric, Chen, Liyuan, Deng, Jie, Graham, Mark S., Sudre, Carole H., Molteni, Erika, Canas, Liane S., Antonelli, Michela, Klaser, Kerstin, Visconti, Alessia, Hammers, Alexander, Chan, Andrew T., Franks, Paul W., Davies, Richard, Wolf, Jonathan, Spector, Tim D., Steves, Claire J., Modat, Marc, Ourselin, Sebastien
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group UK 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8608807/
https://www.ncbi.nlm.nih.gov/pubmed/34811392
http://dx.doi.org/10.1038/s41597-021-01071-x
_version_ 1784602808472829952
author Murray, Benjamin
Kerfoot, Eric
Chen, Liyuan
Deng, Jie
Graham, Mark S.
Sudre, Carole H.
Molteni, Erika
Canas, Liane S.
Antonelli, Michela
Klaser, Kerstin
Visconti, Alessia
Hammers, Alexander
Chan, Andrew T.
Franks, Paul W.
Davies, Richard
Wolf, Jonathan
Spector, Tim D.
Steves, Claire J.
Modat, Marc
Ourselin, Sebastien
author_facet Murray, Benjamin
Kerfoot, Eric
Chen, Liyuan
Deng, Jie
Graham, Mark S.
Sudre, Carole H.
Molteni, Erika
Canas, Liane S.
Antonelli, Michela
Klaser, Kerstin
Visconti, Alessia
Hammers, Alexander
Chan, Andrew T.
Franks, Paul W.
Davies, Richard
Wolf, Jonathan
Spector, Tim D.
Steves, Claire J.
Modat, Marc
Ourselin, Sebastien
author_sort Murray, Benjamin
collection PubMed
description The Covid Symptom Study, a smartphone-based surveillance study on COVID-19 symptoms in the population, is an exemplar of big data citizen science. As of May 23rd, 2021, over 5 million participants have collectively logged over 360 million self-assessment reports since its introduction in March 2020. The success of the Covid Symptom Study creates significant technical challenges around effective data curation. The primary issue is scale. The size of the dataset means that it can no longer be readily processed using standard Python-based data analytics software such as Pandas on commodity hardware. Alternative technologies exist but carry a higher technical complexity and are less accessible to many researchers. We present ExeTera, a Python-based open source software package designed to provide Pandas-like data analytics on datasets that approach terabyte scales. We present its design and capabilities, and show how it is a critical component of a data curation pipeline that enables reproducible research across an international research group for the Covid Symptom Study.
format Online
Article
Text
id pubmed-8608807
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Nature Publishing Group UK
record_format MEDLINE/PubMed
spelling pubmed-86088072021-12-03 Accessible data curation and analytics for international-scale citizen science datasets Murray, Benjamin Kerfoot, Eric Chen, Liyuan Deng, Jie Graham, Mark S. Sudre, Carole H. Molteni, Erika Canas, Liane S. Antonelli, Michela Klaser, Kerstin Visconti, Alessia Hammers, Alexander Chan, Andrew T. Franks, Paul W. Davies, Richard Wolf, Jonathan Spector, Tim D. Steves, Claire J. Modat, Marc Ourselin, Sebastien Sci Data Article The Covid Symptom Study, a smartphone-based surveillance study on COVID-19 symptoms in the population, is an exemplar of big data citizen science. As of May 23rd, 2021, over 5 million participants have collectively logged over 360 million self-assessment reports since its introduction in March 2020. The success of the Covid Symptom Study creates significant technical challenges around effective data curation. The primary issue is scale. The size of the dataset means that it can no longer be readily processed using standard Python-based data analytics software such as Pandas on commodity hardware. Alternative technologies exist but carry a higher technical complexity and are less accessible to many researchers. We present ExeTera, a Python-based open source software package designed to provide Pandas-like data analytics on datasets that approach terabyte scales. We present its design and capabilities, and show how it is a critical component of a data curation pipeline that enables reproducible research across an international research group for the Covid Symptom Study. Nature Publishing Group UK 2021-11-22 /pmc/articles/PMC8608807/ /pubmed/34811392 http://dx.doi.org/10.1038/s41597-021-01071-x Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Article
Murray, Benjamin
Kerfoot, Eric
Chen, Liyuan
Deng, Jie
Graham, Mark S.
Sudre, Carole H.
Molteni, Erika
Canas, Liane S.
Antonelli, Michela
Klaser, Kerstin
Visconti, Alessia
Hammers, Alexander
Chan, Andrew T.
Franks, Paul W.
Davies, Richard
Wolf, Jonathan
Spector, Tim D.
Steves, Claire J.
Modat, Marc
Ourselin, Sebastien
Accessible data curation and analytics for international-scale citizen science datasets
title Accessible data curation and analytics for international-scale citizen science datasets
title_full Accessible data curation and analytics for international-scale citizen science datasets
title_fullStr Accessible data curation and analytics for international-scale citizen science datasets
title_full_unstemmed Accessible data curation and analytics for international-scale citizen science datasets
title_short Accessible data curation and analytics for international-scale citizen science datasets
title_sort accessible data curation and analytics for international-scale citizen science datasets
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8608807/
https://www.ncbi.nlm.nih.gov/pubmed/34811392
http://dx.doi.org/10.1038/s41597-021-01071-x
work_keys_str_mv AT murraybenjamin accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets
AT kerfooteric accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets
AT chenliyuan accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets
AT dengjie accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets
AT grahammarks accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets
AT sudrecaroleh accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets
AT moltenierika accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets
AT canaslianes accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets
AT antonellimichela accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets
AT klaserkerstin accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets
AT viscontialessia accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets
AT hammersalexander accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets
AT chanandrewt accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets
AT frankspaulw accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets
AT daviesrichard accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets
AT wolfjonathan accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets
AT spectortimd accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets
AT stevesclairej accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets
AT modatmarc accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets
AT ourselinsebastien accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets