Cargando…

Orchestrating privacy-protected big data analyses of data from different resources with R and DataSHIELD

Combined analysis of multiple, large datasets is a common objective in the health- and biosciences. Existing methods tend to require researchers to physically bring data together in one place or follow an analysis plan and share results. Developed over the last 10 years, the DataSHIELD platform is a...

Descripción completa

Detalles Bibliográficos
Autores principales: Marcon, Yannick, Bishop, Tom, Avraam, Demetris, Escriba-Montagut, Xavier, Ryser-Welch, Patricia, Wheater, Stuart, Burton, Paul, González, Juan R.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8034722/
https://www.ncbi.nlm.nih.gov/pubmed/33784300
http://dx.doi.org/10.1371/journal.pcbi.1008880
_version_ 1783676585964994560
author Marcon, Yannick
Bishop, Tom
Avraam, Demetris
Escriba-Montagut, Xavier
Ryser-Welch, Patricia
Wheater, Stuart
Burton, Paul
González, Juan R.
author_facet Marcon, Yannick
Bishop, Tom
Avraam, Demetris
Escriba-Montagut, Xavier
Ryser-Welch, Patricia
Wheater, Stuart
Burton, Paul
González, Juan R.
author_sort Marcon, Yannick
collection PubMed
description Combined analysis of multiple, large datasets is a common objective in the health- and biosciences. Existing methods tend to require researchers to physically bring data together in one place or follow an analysis plan and share results. Developed over the last 10 years, the DataSHIELD platform is a collection of R packages that reduce the challenges of these methods. These include ethico-legal constraints which limit researchers’ ability to physically bring data together and the analytical inflexibility associated with conventional approaches to sharing results. The key feature of DataSHIELD is that data from research studies stay on a server at each of the institutions that are responsible for the data. Each institution has control over who can access their data. The platform allows an analyst to pass commands to each server and the analyst receives results that do not disclose the individual-level data of any study participants. DataSHIELD uses Opal which is a data integration system used by epidemiological studies and developed by the OBiBa open source project in the domain of bioinformatics. However, until now the analysis of big data with DataSHIELD has been limited by the storage formats available in Opal and the analysis capabilities available in the DataSHIELD R packages. We present a new architecture (“resources”) for DataSHIELD and Opal to allow large, complex datasets to be used at their original location, in their original format and with external computing facilities. We provide some real big data analysis examples in genomics and geospatial projects. For genomic data analyses, we also illustrate how to extend the resources concept to address specific big data infrastructures such as GA4GH or EGA, and make use of shell commands. Our new infrastructure will help researchers to perform data analyses in a privacy-protected way from existing data sharing initiatives or projects. To help researchers use this framework, we describe selected packages and present an online book (https://isglobal-brge.github.io/resource_bookdown).
format Online
Article
Text
id pubmed-8034722
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-80347222021-04-15 Orchestrating privacy-protected big data analyses of data from different resources with R and DataSHIELD Marcon, Yannick Bishop, Tom Avraam, Demetris Escriba-Montagut, Xavier Ryser-Welch, Patricia Wheater, Stuart Burton, Paul González, Juan R. PLoS Comput Biol Research Article Combined analysis of multiple, large datasets is a common objective in the health- and biosciences. Existing methods tend to require researchers to physically bring data together in one place or follow an analysis plan and share results. Developed over the last 10 years, the DataSHIELD platform is a collection of R packages that reduce the challenges of these methods. These include ethico-legal constraints which limit researchers’ ability to physically bring data together and the analytical inflexibility associated with conventional approaches to sharing results. The key feature of DataSHIELD is that data from research studies stay on a server at each of the institutions that are responsible for the data. Each institution has control over who can access their data. The platform allows an analyst to pass commands to each server and the analyst receives results that do not disclose the individual-level data of any study participants. DataSHIELD uses Opal which is a data integration system used by epidemiological studies and developed by the OBiBa open source project in the domain of bioinformatics. However, until now the analysis of big data with DataSHIELD has been limited by the storage formats available in Opal and the analysis capabilities available in the DataSHIELD R packages. We present a new architecture (“resources”) for DataSHIELD and Opal to allow large, complex datasets to be used at their original location, in their original format and with external computing facilities. We provide some real big data analysis examples in genomics and geospatial projects. For genomic data analyses, we also illustrate how to extend the resources concept to address specific big data infrastructures such as GA4GH or EGA, and make use of shell commands. Our new infrastructure will help researchers to perform data analyses in a privacy-protected way from existing data sharing initiatives or projects. To help researchers use this framework, we describe selected packages and present an online book (https://isglobal-brge.github.io/resource_bookdown). Public Library of Science 2021-03-30 /pmc/articles/PMC8034722/ /pubmed/33784300 http://dx.doi.org/10.1371/journal.pcbi.1008880 Text en © 2021 Marcon et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Marcon, Yannick
Bishop, Tom
Avraam, Demetris
Escriba-Montagut, Xavier
Ryser-Welch, Patricia
Wheater, Stuart
Burton, Paul
González, Juan R.
Orchestrating privacy-protected big data analyses of data from different resources with R and DataSHIELD
title Orchestrating privacy-protected big data analyses of data from different resources with R and DataSHIELD
title_full Orchestrating privacy-protected big data analyses of data from different resources with R and DataSHIELD
title_fullStr Orchestrating privacy-protected big data analyses of data from different resources with R and DataSHIELD
title_full_unstemmed Orchestrating privacy-protected big data analyses of data from different resources with R and DataSHIELD
title_short Orchestrating privacy-protected big data analyses of data from different resources with R and DataSHIELD
title_sort orchestrating privacy-protected big data analyses of data from different resources with r and datashield
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8034722/
https://www.ncbi.nlm.nih.gov/pubmed/33784300
http://dx.doi.org/10.1371/journal.pcbi.1008880
work_keys_str_mv AT marconyannick orchestratingprivacyprotectedbigdataanalysesofdatafromdifferentresourceswithranddatashield
AT bishoptom orchestratingprivacyprotectedbigdataanalysesofdatafromdifferentresourceswithranddatashield
AT avraamdemetris orchestratingprivacyprotectedbigdataanalysesofdatafromdifferentresourceswithranddatashield
AT escribamontagutxavier orchestratingprivacyprotectedbigdataanalysesofdatafromdifferentresourceswithranddatashield
AT ryserwelchpatricia orchestratingprivacyprotectedbigdataanalysesofdatafromdifferentresourceswithranddatashield
AT wheaterstuart orchestratingprivacyprotectedbigdataanalysesofdatafromdifferentresourceswithranddatashield
AT burtonpaul orchestratingprivacyprotectedbigdataanalysesofdatafromdifferentresourceswithranddatashield
AT gonzalezjuanr orchestratingprivacyprotectedbigdataanalysesofdatafromdifferentresourceswithranddatashield