Cargando…

Hadoop and PySpark for reproducibility and scalability of genomic sequencing studies

Modern genomic studies are rapidly growing in scale, and the analytical approaches used to analyze genomic data are increasing in complexity. Genomic data management poses logistic and computational challenges, and analyses are increasingly reliant on genomic annotation resources that create their o...

Descripción completa

Detalles Bibliográficos
Autores principales: WHEELER, NICHOLAS R., BENCHEK, PENELOPE, KUNKLE, BRIAN W., HAMILTON-NELSON, KARA L., WARFE, MIKE, FONDRAN, JEREMY R., HAINES, JONATHAN L., BUSH, WILLIAM S.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6956992/
https://www.ncbi.nlm.nih.gov/pubmed/31797624
_version_ 1783487241693167616
author WHEELER, NICHOLAS R.
BENCHEK, PENELOPE
KUNKLE, BRIAN W.
HAMILTON-NELSON, KARA L.
WARFE, MIKE
FONDRAN, JEREMY R.
HAINES, JONATHAN L.
BUSH, WILLIAM S.
author_facet WHEELER, NICHOLAS R.
BENCHEK, PENELOPE
KUNKLE, BRIAN W.
HAMILTON-NELSON, KARA L.
WARFE, MIKE
FONDRAN, JEREMY R.
HAINES, JONATHAN L.
BUSH, WILLIAM S.
author_sort WHEELER, NICHOLAS R.
collection PubMed
description Modern genomic studies are rapidly growing in scale, and the analytical approaches used to analyze genomic data are increasing in complexity. Genomic data management poses logistic and computational challenges, and analyses are increasingly reliant on genomic annotation resources that create their own data management and versioning issues. As a result, genomic datasets are increasingly handled in ways that limit the rigor and reproducibility of 1many analyses. In this work, we examine the use of the Spark infrastructure for the management, access, and analysis of genomic data in comparison to traditional genomic workflows on typical cluster environments. We validate the framework by reproducing previously published results from the Alzheimer’s Disease Sequencing Project. Using the framework and analyses designed using Jupyter notebooks, Spark provides improved workflows, reduces user-driven data partitioning, and enhances the portability and reproducibility of distributed analyses required for large-scale genomic studies.
format Online
Article
Text
id pubmed-6956992
institution National Center for Biotechnology Information
language English
publishDate 2020
record_format MEDLINE/PubMed
spelling pubmed-69569922020-01-13 Hadoop and PySpark for reproducibility and scalability of genomic sequencing studies WHEELER, NICHOLAS R. BENCHEK, PENELOPE KUNKLE, BRIAN W. HAMILTON-NELSON, KARA L. WARFE, MIKE FONDRAN, JEREMY R. HAINES, JONATHAN L. BUSH, WILLIAM S. Pac Symp Biocomput Article Modern genomic studies are rapidly growing in scale, and the analytical approaches used to analyze genomic data are increasing in complexity. Genomic data management poses logistic and computational challenges, and analyses are increasingly reliant on genomic annotation resources that create their own data management and versioning issues. As a result, genomic datasets are increasingly handled in ways that limit the rigor and reproducibility of 1many analyses. In this work, we examine the use of the Spark infrastructure for the management, access, and analysis of genomic data in comparison to traditional genomic workflows on typical cluster environments. We validate the framework by reproducing previously published results from the Alzheimer’s Disease Sequencing Project. Using the framework and analyses designed using Jupyter notebooks, Spark provides improved workflows, reduces user-driven data partitioning, and enhances the portability and reproducibility of distributed analyses required for large-scale genomic studies. 2020 /pmc/articles/PMC6956992/ /pubmed/31797624 Text en http://creativecommons.org/licenses/by-nc/4.0/ Open Access chapter published by World Scientific Publishing Company and distributed under the terms of the Creative Commons Attribution Non-Commercial (CC BY-NC) 4.0 License.
spellingShingle Article
WHEELER, NICHOLAS R.
BENCHEK, PENELOPE
KUNKLE, BRIAN W.
HAMILTON-NELSON, KARA L.
WARFE, MIKE
FONDRAN, JEREMY R.
HAINES, JONATHAN L.
BUSH, WILLIAM S.
Hadoop and PySpark for reproducibility and scalability of genomic sequencing studies
title Hadoop and PySpark for reproducibility and scalability of genomic sequencing studies
title_full Hadoop and PySpark for reproducibility and scalability of genomic sequencing studies
title_fullStr Hadoop and PySpark for reproducibility and scalability of genomic sequencing studies
title_full_unstemmed Hadoop and PySpark for reproducibility and scalability of genomic sequencing studies
title_short Hadoop and PySpark for reproducibility and scalability of genomic sequencing studies
title_sort hadoop and pyspark for reproducibility and scalability of genomic sequencing studies
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6956992/
https://www.ncbi.nlm.nih.gov/pubmed/31797624
work_keys_str_mv AT wheelernicholasr hadoopandpysparkforreproducibilityandscalabilityofgenomicsequencingstudies
AT benchekpenelope hadoopandpysparkforreproducibilityandscalabilityofgenomicsequencingstudies
AT kunklebrianw hadoopandpysparkforreproducibilityandscalabilityofgenomicsequencingstudies
AT hamiltonnelsonkaral hadoopandpysparkforreproducibilityandscalabilityofgenomicsequencingstudies
AT warfemike hadoopandpysparkforreproducibilityandscalabilityofgenomicsequencingstudies
AT fondranjeremyr hadoopandpysparkforreproducibilityandscalabilityofgenomicsequencingstudies
AT hainesjonathanl hadoopandpysparkforreproducibilityandscalabilityofgenomicsequencingstudies
AT bushwilliams hadoopandpysparkforreproducibilityandscalabilityofgenomicsequencingstudies