Cargando…

Shared data science infrastructure for genomics data

BACKGROUND: Creating a scalable computational infrastructure to analyze the wealth of information contained in data repositories is difficult due to significant barriers in organizing, extracting and analyzing relevant data. Shared data science infrastructures like Boa(g) is needed to efficiently pr...

Descripción completa

Detalles Bibliográficos
Autores principales:	Bagheri, Hamid, Muppirala, Usha, Masonbrink, Rick E., Severin, Andrew J., Rajan, Hridesh
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2019
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6704658/ https://www.ncbi.nlm.nih.gov/pubmed/31438850 http://dx.doi.org/10.1186/s12859-019-2967-2

_version_	1783445545434480640
author	Bagheri, Hamid Muppirala, Usha Masonbrink, Rick E. Severin, Andrew J. Rajan, Hridesh
author_facet	Bagheri, Hamid Muppirala, Usha Masonbrink, Rick E. Severin, Andrew J. Rajan, Hridesh
author_sort	Bagheri, Hamid
collection	PubMed
description	BACKGROUND: Creating a scalable computational infrastructure to analyze the wealth of information contained in data repositories is difficult due to significant barriers in organizing, extracting and analyzing relevant data. Shared data science infrastructures like Boa(g) is needed to efficiently process and parse data contained in large data repositories. The main features of Boa(g) are inspired from existing languages for data intensive computing and can easily integrate data from biological data repositories. RESULTS: As a proof of concept, Boa for genomics, Boa(g), has been implemented to analyze RefSeq’s 153,848 annotation (GFF) and assembly (FASTA) file metadata. Boa(g) provides a massive improvement from existing solutions like Python and MongoDB, by utilizing a domain-specific language that uses Hadoop infrastructure for a smaller storage footprint that scales well and requires fewer lines of code. We execute scripts through Boa(g) to answer questions about the genomes in RefSeq. We identify the largest and smallest genomes deposited, explore exon frequencies for assemblies after 2016, identify the most commonly used bacterial genome assembly program, and address how animal genome assemblies have improved since 2016. Boa(g) databases provide a significant reduction in required storage of the raw data and a significant speed up in its ability to query large datasets due to automated parallelization and distribution of Hadoop infrastructure during computations. CONCLUSIONS: In order to keep pace with our ability to produce biological data, innovative methods are required. The Shared Data Science Infrastructure, Boa(g), provides researchers a greater access to researchers to efficiently explore data in new ways. We demonstrate the potential of a the domain specific language Boa(g) using the RefSeq database to explore how deposited genome assemblies and annotations are changing over time. This is a small example of how Boa(g) could be used with large biological datasets.
format	Online Article Text
id	pubmed-6704658
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-67046582019-08-22 Shared data science infrastructure for genomics data Bagheri, Hamid Muppirala, Usha Masonbrink, Rick E. Severin, Andrew J. Rajan, Hridesh BMC Bioinformatics Research Article BACKGROUND: Creating a scalable computational infrastructure to analyze the wealth of information contained in data repositories is difficult due to significant barriers in organizing, extracting and analyzing relevant data. Shared data science infrastructures like Boa(g) is needed to efficiently process and parse data contained in large data repositories. The main features of Boa(g) are inspired from existing languages for data intensive computing and can easily integrate data from biological data repositories. RESULTS: As a proof of concept, Boa for genomics, Boa(g), has been implemented to analyze RefSeq’s 153,848 annotation (GFF) and assembly (FASTA) file metadata. Boa(g) provides a massive improvement from existing solutions like Python and MongoDB, by utilizing a domain-specific language that uses Hadoop infrastructure for a smaller storage footprint that scales well and requires fewer lines of code. We execute scripts through Boa(g) to answer questions about the genomes in RefSeq. We identify the largest and smallest genomes deposited, explore exon frequencies for assemblies after 2016, identify the most commonly used bacterial genome assembly program, and address how animal genome assemblies have improved since 2016. Boa(g) databases provide a significant reduction in required storage of the raw data and a significant speed up in its ability to query large datasets due to automated parallelization and distribution of Hadoop infrastructure during computations. CONCLUSIONS: In order to keep pace with our ability to produce biological data, innovative methods are required. The Shared Data Science Infrastructure, Boa(g), provides researchers a greater access to researchers to efficiently explore data in new ways. We demonstrate the potential of a the domain specific language Boa(g) using the RefSeq database to explore how deposited genome assemblies and annotations are changing over time. This is a small example of how Boa(g) could be used with large biological datasets. BioMed Central 2019-08-22 /pmc/articles/PMC6704658/ /pubmed/31438850 http://dx.doi.org/10.1186/s12859-019-2967-2 Text en © The Author(s). 2019 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Bagheri, Hamid Muppirala, Usha Masonbrink, Rick E. Severin, Andrew J. Rajan, Hridesh Shared data science infrastructure for genomics data
title	Shared data science infrastructure for genomics data
title_full	Shared data science infrastructure for genomics data
title_fullStr	Shared data science infrastructure for genomics data
title_full_unstemmed	Shared data science infrastructure for genomics data
title_short	Shared data science infrastructure for genomics data
title_sort	shared data science infrastructure for genomics data
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6704658/ https://www.ncbi.nlm.nih.gov/pubmed/31438850 http://dx.doi.org/10.1186/s12859-019-2967-2
work_keys_str_mv	AT bagherihamid shareddatascienceinfrastructureforgenomicsdata AT muppiralausha shareddatascienceinfrastructureforgenomicsdata AT masonbrinkricke shareddatascienceinfrastructureforgenomicsdata AT severinandrewj shareddatascienceinfrastructureforgenomicsdata AT rajanhridesh shareddatascienceinfrastructureforgenomicsdata

Shared data science infrastructure for genomics data

Ejemplares similares