Cargando…

NPARS—A Novel Approach to Address Accuracy and Reproducibility in Genomic Data Science

Background: Accuracy and reproducibility are vital in science and presents a significant challenge in the emerging discipline of data science, especially when the data are scientifically complex and massive in size. Further complicating matters, in the field of genomic-based science high-throughput...

Descripción completa

Detalles Bibliográficos
Autores principales: Ma, Li, Peterson, Erich A., Shin, Ik Jae, Muesse, Jason, Marino, Katy, Steliga, Matthew A., Johann, Donald J.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8503682/
https://www.ncbi.nlm.nih.gov/pubmed/34647017
http://dx.doi.org/10.3389/fdata.2021.725095
_version_ 1784581180945858560
author Ma, Li
Peterson, Erich A.
Shin, Ik Jae
Muesse, Jason
Marino, Katy
Steliga, Matthew A.
Johann, Donald J.
author_facet Ma, Li
Peterson, Erich A.
Shin, Ik Jae
Muesse, Jason
Marino, Katy
Steliga, Matthew A.
Johann, Donald J.
author_sort Ma, Li
collection PubMed
description Background: Accuracy and reproducibility are vital in science and presents a significant challenge in the emerging discipline of data science, especially when the data are scientifically complex and massive in size. Further complicating matters, in the field of genomic-based science high-throughput sequencing technologies generate considerable amounts of data that needs to be stored, manipulated, and analyzed using a plethora of software tools. Researchers are rarely able to reproduce published genomic studies. Results: Presented is a novel approach which facilitates accuracy and reproducibility for large genomic research data sets. All data needed is loaded into a portable local database, which serves as an interface for well-known software frameworks. These include python-based Jupyter Notebooks and the use of RStudio projects and R markdown. All software is encapsulated using Docker containers and managed by Git, simplifying software configuration management. Conclusion: Accuracy and reproducibility in science is of a paramount importance. For the biomedical sciences, advances in high throughput technologies, molecular biology and quantitative methods are providing unprecedented insights into disease mechanisms. With these insights come the associated challenge of scientific data that is complex and massive in size. This makes collaboration, verification, validation, and reproducibility of findings difficult. To address these challenges the NGS post-pipeline accuracy and reproducibility system (NPARS) was developed. NPARS is a robust software infrastructure and methodology that can encapsulate data, code, and reporting for large genomic studies. This paper demonstrates the successful use of NPARS on large and complex genomic data sets across different computational platforms.
format Online
Article
Text
id pubmed-8503682
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-85036822021-10-12 NPARS—A Novel Approach to Address Accuracy and Reproducibility in Genomic Data Science Ma, Li Peterson, Erich A. Shin, Ik Jae Muesse, Jason Marino, Katy Steliga, Matthew A. Johann, Donald J. Front Big Data Big Data Background: Accuracy and reproducibility are vital in science and presents a significant challenge in the emerging discipline of data science, especially when the data are scientifically complex and massive in size. Further complicating matters, in the field of genomic-based science high-throughput sequencing technologies generate considerable amounts of data that needs to be stored, manipulated, and analyzed using a plethora of software tools. Researchers are rarely able to reproduce published genomic studies. Results: Presented is a novel approach which facilitates accuracy and reproducibility for large genomic research data sets. All data needed is loaded into a portable local database, which serves as an interface for well-known software frameworks. These include python-based Jupyter Notebooks and the use of RStudio projects and R markdown. All software is encapsulated using Docker containers and managed by Git, simplifying software configuration management. Conclusion: Accuracy and reproducibility in science is of a paramount importance. For the biomedical sciences, advances in high throughput technologies, molecular biology and quantitative methods are providing unprecedented insights into disease mechanisms. With these insights come the associated challenge of scientific data that is complex and massive in size. This makes collaboration, verification, validation, and reproducibility of findings difficult. To address these challenges the NGS post-pipeline accuracy and reproducibility system (NPARS) was developed. NPARS is a robust software infrastructure and methodology that can encapsulate data, code, and reporting for large genomic studies. This paper demonstrates the successful use of NPARS on large and complex genomic data sets across different computational platforms. Frontiers Media S.A. 2021-09-27 /pmc/articles/PMC8503682/ /pubmed/34647017 http://dx.doi.org/10.3389/fdata.2021.725095 Text en Copyright © 2021 Ma, Peterson, Shin, Muesse, Marino, Steliga and Johann. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Big Data
Ma, Li
Peterson, Erich A.
Shin, Ik Jae
Muesse, Jason
Marino, Katy
Steliga, Matthew A.
Johann, Donald J.
NPARS—A Novel Approach to Address Accuracy and Reproducibility in Genomic Data Science
title NPARS—A Novel Approach to Address Accuracy and Reproducibility in Genomic Data Science
title_full NPARS—A Novel Approach to Address Accuracy and Reproducibility in Genomic Data Science
title_fullStr NPARS—A Novel Approach to Address Accuracy and Reproducibility in Genomic Data Science
title_full_unstemmed NPARS—A Novel Approach to Address Accuracy and Reproducibility in Genomic Data Science
title_short NPARS—A Novel Approach to Address Accuracy and Reproducibility in Genomic Data Science
title_sort npars—a novel approach to address accuracy and reproducibility in genomic data science
topic Big Data
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8503682/
https://www.ncbi.nlm.nih.gov/pubmed/34647017
http://dx.doi.org/10.3389/fdata.2021.725095
work_keys_str_mv AT mali nparsanovelapproachtoaddressaccuracyandreproducibilityingenomicdatascience
AT petersonericha nparsanovelapproachtoaddressaccuracyandreproducibilityingenomicdatascience
AT shinikjae nparsanovelapproachtoaddressaccuracyandreproducibilityingenomicdatascience
AT muessejason nparsanovelapproachtoaddressaccuracyandreproducibilityingenomicdatascience
AT marinokaty nparsanovelapproachtoaddressaccuracyandreproducibilityingenomicdatascience
AT steligamatthewa nparsanovelapproachtoaddressaccuracyandreproducibilityingenomicdatascience
AT johanndonaldj nparsanovelapproachtoaddressaccuracyandreproducibilityingenomicdatascience