Cargando…

RES2/406: Making Complex Datasets Available over the Web

INTRODUCTION: The internet is the (current) ideal medium for sharing simple data: but the tools for describing complicated datasets, and the ethics and resulting technology for sharing confidential data are less well understood. METHODS: I first describe a simple dataset we've put on the web -...

Descripción completa

Detalles Bibliográficos
Autor principal: Walker, N
Formato: Texto
Lenguaje:English
Publicado: Gunther Eysenbach 1999
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1761761/
http://dx.doi.org/10.2196/jmir.1.suppl1.e78
_version_ 1782131470925561856
author Walker, N
author_facet Walker, N
author_sort Walker, N
collection PubMed
description INTRODUCTION: The internet is the (current) ideal medium for sharing simple data: but the tools for describing complicated datasets, and the ethics and resulting technology for sharing confidential data are less well understood. METHODS: I first describe a simple dataset we've put on the web - some of the world's first genome screen data. The data is anonymous; there was full subject consent; there is no foreseeable subject harm/benefit from data release; and the data sets are in a form readily understood by scientists working in the field. I then describe a large-scale longitudinal epidemiological study, and the tools used to make this comprehensible to secondary data users - the main innovation being a searchable data dictionary and interactive decision support for selecting data subsets from the multi-thousand variable whole. Thirdly I describe the current data access arrangements - "good enough" anonymity, and ftp access for signed-up collaborators. Lastly I describe fully-functioning experimental alternatives: aggregated tables (generated with reference to the data dictionary) and raw data access for named collaborators via encryption, the web's HTTPS protocol using Secure Socket Layers. RESULTS: Datasets can be shared via the Web, however complex or confidential. For a simple (but important) dataset, see: http://www.mrc-bsu.cam.ac.uk/MSgenetics/ . For a complex dataset and support tools, see: http://www.mrc-bsu.cam.ac.uk/cfas/ or https:// www.mrc-bsu.cam.ac.uk/cfas/. This currently uses US-export (i.e. weak) levels of encryption. DISCUSSION: There is increasing pressure (from, for example, the Medical Research Council in the UK) to share data collected during publicly-funded medical research. While the social sciences have shared data for many years via archive sites, "patient confidentiality" has prevented it in the medical world. Ironically, the increased use of biological samples- which require far greater stress on confidentiality and the anonymity of public records - have led to proposals for public databases of, and potential competition for, these scarce, expensive resources. For social sciences, record anonymisation is the stripping of identifiers, but they also rely on the fierce legalese of "undertaking forms" to prevent subject identification. This model is breaking down with linked genotypic/phenotypic data - where it might become hugely financially worthwhile to identify a study subject. The data dictionary approach - adopted as an aid to understanding a large complex dataset, can also be used to generate anonymised subsets of the data, and aggregated tables live on the Web. However, full access will require the newer, secure web protocols - if we can find the political and financial will to buy it in from the States.
format Text
id pubmed-1761761
institution National Center for Biotechnology Information
language English
publishDate 1999
publisher Gunther Eysenbach
record_format MEDLINE/PubMed
spelling pubmed-17617612007-01-03 RES2/406: Making Complex Datasets Available over the Web Walker, N J Med Internet Res Abstract INTRODUCTION: The internet is the (current) ideal medium for sharing simple data: but the tools for describing complicated datasets, and the ethics and resulting technology for sharing confidential data are less well understood. METHODS: I first describe a simple dataset we've put on the web - some of the world's first genome screen data. The data is anonymous; there was full subject consent; there is no foreseeable subject harm/benefit from data release; and the data sets are in a form readily understood by scientists working in the field. I then describe a large-scale longitudinal epidemiological study, and the tools used to make this comprehensible to secondary data users - the main innovation being a searchable data dictionary and interactive decision support for selecting data subsets from the multi-thousand variable whole. Thirdly I describe the current data access arrangements - "good enough" anonymity, and ftp access for signed-up collaborators. Lastly I describe fully-functioning experimental alternatives: aggregated tables (generated with reference to the data dictionary) and raw data access for named collaborators via encryption, the web's HTTPS protocol using Secure Socket Layers. RESULTS: Datasets can be shared via the Web, however complex or confidential. For a simple (but important) dataset, see: http://www.mrc-bsu.cam.ac.uk/MSgenetics/ . For a complex dataset and support tools, see: http://www.mrc-bsu.cam.ac.uk/cfas/ or https:// www.mrc-bsu.cam.ac.uk/cfas/. This currently uses US-export (i.e. weak) levels of encryption. DISCUSSION: There is increasing pressure (from, for example, the Medical Research Council in the UK) to share data collected during publicly-funded medical research. While the social sciences have shared data for many years via archive sites, "patient confidentiality" has prevented it in the medical world. Ironically, the increased use of biological samples- which require far greater stress on confidentiality and the anonymity of public records - have led to proposals for public databases of, and potential competition for, these scarce, expensive resources. For social sciences, record anonymisation is the stripping of identifiers, but they also rely on the fierce legalese of "undertaking forms" to prevent subject identification. This model is breaking down with linked genotypic/phenotypic data - where it might become hugely financially worthwhile to identify a study subject. The data dictionary approach - adopted as an aid to understanding a large complex dataset, can also be used to generate anonymised subsets of the data, and aggregated tables live on the Web. However, full access will require the newer, secure web protocols - if we can find the political and financial will to buy it in from the States. Gunther Eysenbach 1999-09-19 /pmc/articles/PMC1761761/ http://dx.doi.org/10.2196/jmir.1.suppl1.e78 Text en Except where otherwise noted, articles published in the Journal of Medical Internet Research are distributed under the terms of the Creative Commons Attribution License (http://www.creativecommons.org/licenses/by/2.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Abstract
Walker, N
RES2/406: Making Complex Datasets Available over the Web
title RES2/406: Making Complex Datasets Available over the Web
title_full RES2/406: Making Complex Datasets Available over the Web
title_fullStr RES2/406: Making Complex Datasets Available over the Web
title_full_unstemmed RES2/406: Making Complex Datasets Available over the Web
title_short RES2/406: Making Complex Datasets Available over the Web
title_sort res2/406: making complex datasets available over the web
topic Abstract
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1761761/
http://dx.doi.org/10.2196/jmir.1.suppl1.e78
work_keys_str_mv AT walkern res2406makingcomplexdatasetsavailableovertheweb