Cargando…
Hadoop and PySpark for reproducibility and scalability of genomic sequencing studies
Modern genomic studies are rapidly growing in scale, and the analytical approaches used to analyze genomic data are increasing in complexity. Genomic data management poses logistic and computational challenges, and analyses are increasingly reliant on genomic annotation resources that create their o...
Autores principales: | , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6956992/ https://www.ncbi.nlm.nih.gov/pubmed/31797624 |
_version_ | 1783487241693167616 |
---|---|
author | WHEELER, NICHOLAS R. BENCHEK, PENELOPE KUNKLE, BRIAN W. HAMILTON-NELSON, KARA L. WARFE, MIKE FONDRAN, JEREMY R. HAINES, JONATHAN L. BUSH, WILLIAM S. |
author_facet | WHEELER, NICHOLAS R. BENCHEK, PENELOPE KUNKLE, BRIAN W. HAMILTON-NELSON, KARA L. WARFE, MIKE FONDRAN, JEREMY R. HAINES, JONATHAN L. BUSH, WILLIAM S. |
author_sort | WHEELER, NICHOLAS R. |
collection | PubMed |
description | Modern genomic studies are rapidly growing in scale, and the analytical approaches used to analyze genomic data are increasing in complexity. Genomic data management poses logistic and computational challenges, and analyses are increasingly reliant on genomic annotation resources that create their own data management and versioning issues. As a result, genomic datasets are increasingly handled in ways that limit the rigor and reproducibility of 1many analyses. In this work, we examine the use of the Spark infrastructure for the management, access, and analysis of genomic data in comparison to traditional genomic workflows on typical cluster environments. We validate the framework by reproducing previously published results from the Alzheimer’s Disease Sequencing Project. Using the framework and analyses designed using Jupyter notebooks, Spark provides improved workflows, reduces user-driven data partitioning, and enhances the portability and reproducibility of distributed analyses required for large-scale genomic studies. |
format | Online Article Text |
id | pubmed-6956992 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
record_format | MEDLINE/PubMed |
spelling | pubmed-69569922020-01-13 Hadoop and PySpark for reproducibility and scalability of genomic sequencing studies WHEELER, NICHOLAS R. BENCHEK, PENELOPE KUNKLE, BRIAN W. HAMILTON-NELSON, KARA L. WARFE, MIKE FONDRAN, JEREMY R. HAINES, JONATHAN L. BUSH, WILLIAM S. Pac Symp Biocomput Article Modern genomic studies are rapidly growing in scale, and the analytical approaches used to analyze genomic data are increasing in complexity. Genomic data management poses logistic and computational challenges, and analyses are increasingly reliant on genomic annotation resources that create their own data management and versioning issues. As a result, genomic datasets are increasingly handled in ways that limit the rigor and reproducibility of 1many analyses. In this work, we examine the use of the Spark infrastructure for the management, access, and analysis of genomic data in comparison to traditional genomic workflows on typical cluster environments. We validate the framework by reproducing previously published results from the Alzheimer’s Disease Sequencing Project. Using the framework and analyses designed using Jupyter notebooks, Spark provides improved workflows, reduces user-driven data partitioning, and enhances the portability and reproducibility of distributed analyses required for large-scale genomic studies. 2020 /pmc/articles/PMC6956992/ /pubmed/31797624 Text en http://creativecommons.org/licenses/by-nc/4.0/ Open Access chapter published by World Scientific Publishing Company and distributed under the terms of the Creative Commons Attribution Non-Commercial (CC BY-NC) 4.0 License. |
spellingShingle | Article WHEELER, NICHOLAS R. BENCHEK, PENELOPE KUNKLE, BRIAN W. HAMILTON-NELSON, KARA L. WARFE, MIKE FONDRAN, JEREMY R. HAINES, JONATHAN L. BUSH, WILLIAM S. Hadoop and PySpark for reproducibility and scalability of genomic sequencing studies |
title | Hadoop and PySpark for reproducibility and scalability of genomic sequencing studies |
title_full | Hadoop and PySpark for reproducibility and scalability of genomic sequencing studies |
title_fullStr | Hadoop and PySpark for reproducibility and scalability of genomic sequencing studies |
title_full_unstemmed | Hadoop and PySpark for reproducibility and scalability of genomic sequencing studies |
title_short | Hadoop and PySpark for reproducibility and scalability of genomic sequencing studies |
title_sort | hadoop and pyspark for reproducibility and scalability of genomic sequencing studies |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6956992/ https://www.ncbi.nlm.nih.gov/pubmed/31797624 |
work_keys_str_mv | AT wheelernicholasr hadoopandpysparkforreproducibilityandscalabilityofgenomicsequencingstudies AT benchekpenelope hadoopandpysparkforreproducibilityandscalabilityofgenomicsequencingstudies AT kunklebrianw hadoopandpysparkforreproducibilityandscalabilityofgenomicsequencingstudies AT hamiltonnelsonkaral hadoopandpysparkforreproducibilityandscalabilityofgenomicsequencingstudies AT warfemike hadoopandpysparkforreproducibilityandscalabilityofgenomicsequencingstudies AT fondranjeremyr hadoopandpysparkforreproducibilityandscalabilityofgenomicsequencingstudies AT hainesjonathanl hadoopandpysparkforreproducibilityandscalabilityofgenomicsequencingstudies AT bushwilliams hadoopandpysparkforreproducibilityandscalabilityofgenomicsequencingstudies |