Cargando…

TCGA Expedition: A Data Acquisition and Management System for TCGA Data

BACKGROUND: The Cancer Genome Atlas Project (TCGA) is a National Cancer Institute effort to profile at least 500 cases of 20 different tumor types using genomic platforms and to make these data, both raw and processed, available to all researchers. TCGA data are currently over 1.2 Petabyte in size a...

Descripción completa

Detalles Bibliográficos
Autores principales: Chandran, Uma R., Medvedeva, Olga P., Barmada, M. Michael, Blood, Philip D., Chakka, Anish, Luthra, Soumya, Ferreira, Antonio, Wong, Kim F., Lee, Adrian V., Zhang, Zhihui, Budden, Robert, Scott, J. Ray, Berndt, Annerose, Berg, Jeremy M., Jacobson, Rebecca S.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5082933/
https://www.ncbi.nlm.nih.gov/pubmed/27788220
http://dx.doi.org/10.1371/journal.pone.0165395
_version_ 1782463154596347904
author Chandran, Uma R.
Medvedeva, Olga P.
Barmada, M. Michael
Blood, Philip D.
Chakka, Anish
Luthra, Soumya
Ferreira, Antonio
Wong, Kim F.
Lee, Adrian V.
Zhang, Zhihui
Budden, Robert
Scott, J. Ray
Berndt, Annerose
Berg, Jeremy M.
Jacobson, Rebecca S.
author_facet Chandran, Uma R.
Medvedeva, Olga P.
Barmada, M. Michael
Blood, Philip D.
Chakka, Anish
Luthra, Soumya
Ferreira, Antonio
Wong, Kim F.
Lee, Adrian V.
Zhang, Zhihui
Budden, Robert
Scott, J. Ray
Berndt, Annerose
Berg, Jeremy M.
Jacobson, Rebecca S.
author_sort Chandran, Uma R.
collection PubMed
description BACKGROUND: The Cancer Genome Atlas Project (TCGA) is a National Cancer Institute effort to profile at least 500 cases of 20 different tumor types using genomic platforms and to make these data, both raw and processed, available to all researchers. TCGA data are currently over 1.2 Petabyte in size and include whole genome sequence (WGS), whole exome sequence, methylation, RNA expression, proteomic, and clinical datasets. Publicly accessible TCGA data are released through public portals, but many challenges exist in navigating and using data obtained from these sites. We developed TCGA Expedition to support the research community focused on computational methods for cancer research. Data obtained, versioned, and archived using TCGA Expedition supports command line access at high-performance computing facilities as well as some functionality with third party tools. For a subset of TCGA data collected at University of Pittsburgh, we also re-associate TCGA data with de-identified data from the electronic health records. Here we describe the software as well as the architecture of our repository, methods for loading of TCGA data to multiple platforms, and security and regulatory controls that conform to federal best practices. RESULTS: TCGA Expedition software consists of a set of scripts written in Bash, Python and Java that download, extract, harmonize, version and store all TCGA data and metadata. The software generates a versioned, participant- and sample-centered, local TCGA data directory with metadata structures that directly reference the local data files as well as the original data files. The software supports flexible searches of the data via a web portal, user-centric data tracking tools, and data provenance tools. Using this software, we created a collaborative repository, the Pittsburgh Genome Resource Repository (PGRR) that enabled investigators at our institution to work with all TCGA data formats, and to interrogate these data with analysis pipelines, and associated tools. WGS data are especially challenging for individual investigators to use, due to issues with downloading, storage, and processing; having locally accessible WGS BAM files has proven invaluable. CONCLUSION: Our open-source, freely available TCGA Expedition software can be used to create a local collaborative infrastructure for acquiring, managing, and analyzing TCGA data and other large public datasets.
format Online
Article
Text
id pubmed-5082933
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-50829332016-11-04 TCGA Expedition: A Data Acquisition and Management System for TCGA Data Chandran, Uma R. Medvedeva, Olga P. Barmada, M. Michael Blood, Philip D. Chakka, Anish Luthra, Soumya Ferreira, Antonio Wong, Kim F. Lee, Adrian V. Zhang, Zhihui Budden, Robert Scott, J. Ray Berndt, Annerose Berg, Jeremy M. Jacobson, Rebecca S. PLoS One Research Article BACKGROUND: The Cancer Genome Atlas Project (TCGA) is a National Cancer Institute effort to profile at least 500 cases of 20 different tumor types using genomic platforms and to make these data, both raw and processed, available to all researchers. TCGA data are currently over 1.2 Petabyte in size and include whole genome sequence (WGS), whole exome sequence, methylation, RNA expression, proteomic, and clinical datasets. Publicly accessible TCGA data are released through public portals, but many challenges exist in navigating and using data obtained from these sites. We developed TCGA Expedition to support the research community focused on computational methods for cancer research. Data obtained, versioned, and archived using TCGA Expedition supports command line access at high-performance computing facilities as well as some functionality with third party tools. For a subset of TCGA data collected at University of Pittsburgh, we also re-associate TCGA data with de-identified data from the electronic health records. Here we describe the software as well as the architecture of our repository, methods for loading of TCGA data to multiple platforms, and security and regulatory controls that conform to federal best practices. RESULTS: TCGA Expedition software consists of a set of scripts written in Bash, Python and Java that download, extract, harmonize, version and store all TCGA data and metadata. The software generates a versioned, participant- and sample-centered, local TCGA data directory with metadata structures that directly reference the local data files as well as the original data files. The software supports flexible searches of the data via a web portal, user-centric data tracking tools, and data provenance tools. Using this software, we created a collaborative repository, the Pittsburgh Genome Resource Repository (PGRR) that enabled investigators at our institution to work with all TCGA data formats, and to interrogate these data with analysis pipelines, and associated tools. WGS data are especially challenging for individual investigators to use, due to issues with downloading, storage, and processing; having locally accessible WGS BAM files has proven invaluable. CONCLUSION: Our open-source, freely available TCGA Expedition software can be used to create a local collaborative infrastructure for acquiring, managing, and analyzing TCGA data and other large public datasets. Public Library of Science 2016-10-27 /pmc/articles/PMC5082933/ /pubmed/27788220 http://dx.doi.org/10.1371/journal.pone.0165395 Text en © 2016 Chandran et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Chandran, Uma R.
Medvedeva, Olga P.
Barmada, M. Michael
Blood, Philip D.
Chakka, Anish
Luthra, Soumya
Ferreira, Antonio
Wong, Kim F.
Lee, Adrian V.
Zhang, Zhihui
Budden, Robert
Scott, J. Ray
Berndt, Annerose
Berg, Jeremy M.
Jacobson, Rebecca S.
TCGA Expedition: A Data Acquisition and Management System for TCGA Data
title TCGA Expedition: A Data Acquisition and Management System for TCGA Data
title_full TCGA Expedition: A Data Acquisition and Management System for TCGA Data
title_fullStr TCGA Expedition: A Data Acquisition and Management System for TCGA Data
title_full_unstemmed TCGA Expedition: A Data Acquisition and Management System for TCGA Data
title_short TCGA Expedition: A Data Acquisition and Management System for TCGA Data
title_sort tcga expedition: a data acquisition and management system for tcga data
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5082933/
https://www.ncbi.nlm.nih.gov/pubmed/27788220
http://dx.doi.org/10.1371/journal.pone.0165395
work_keys_str_mv AT chandranumar tcgaexpeditionadataacquisitionandmanagementsystemfortcgadata
AT medvedevaolgap tcgaexpeditionadataacquisitionandmanagementsystemfortcgadata
AT barmadammichael tcgaexpeditionadataacquisitionandmanagementsystemfortcgadata
AT bloodphilipd tcgaexpeditionadataacquisitionandmanagementsystemfortcgadata
AT chakkaanish tcgaexpeditionadataacquisitionandmanagementsystemfortcgadata
AT luthrasoumya tcgaexpeditionadataacquisitionandmanagementsystemfortcgadata
AT ferreiraantonio tcgaexpeditionadataacquisitionandmanagementsystemfortcgadata
AT wongkimf tcgaexpeditionadataacquisitionandmanagementsystemfortcgadata
AT leeadrianv tcgaexpeditionadataacquisitionandmanagementsystemfortcgadata
AT zhangzhihui tcgaexpeditionadataacquisitionandmanagementsystemfortcgadata
AT buddenrobert tcgaexpeditionadataacquisitionandmanagementsystemfortcgadata
AT scottjray tcgaexpeditionadataacquisitionandmanagementsystemfortcgadata
AT berndtannerose tcgaexpeditionadataacquisitionandmanagementsystemfortcgadata
AT bergjeremym tcgaexpeditionadataacquisitionandmanagementsystemfortcgadata
AT jacobsonrebeccas tcgaexpeditionadataacquisitionandmanagementsystemfortcgadata