Cargando…

GeniePool: genomic database with corresponding annotated samples based on a cloud data lake architecture

In recent years, there are a huge influx of genomic data and a growing need for its phenotypic correlations, yet existing genomic databases do not allow easy storage and accessibility to the combined phenotypic–genotypic information. Freely accessible allele frequency (AF) databases, such as gnomAD,...

Descripción completa

Detalles Bibliográficos
Autores principales:	Hadar, Noam, Weintraub, Grisha, Gudes, Ehud, Dolev, Shlomi, Birk, Ohad S
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2023
Materias:	Original Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10263466/ https://www.ncbi.nlm.nih.gov/pubmed/37311148 http://dx.doi.org/10.1093/database/baad043

_version_	1785058247682555904
author	Hadar, Noam Weintraub, Grisha Gudes, Ehud Dolev, Shlomi Birk, Ohad S
author_facet	Hadar, Noam Weintraub, Grisha Gudes, Ehud Dolev, Shlomi Birk, Ohad S
author_sort	Hadar, Noam
collection	PubMed
description	In recent years, there are a huge influx of genomic data and a growing need for its phenotypic correlations, yet existing genomic databases do not allow easy storage and accessibility to the combined phenotypic–genotypic information. Freely accessible allele frequency (AF) databases, such as gnomAD, are crucial for evaluating variants but lack correlated phenotype data. The Sequence Read Archive (SRA) accumulates hundreds of thousands of next-generation sequencing (NGS) samples tagged by their submitters and various attributes. However, samples are stored in large raw format files, inaccessible for a common user. To make thousands of NGS samples and their corresponding additional attributes easily available to clinicians and researchers, we generated a pipeline that continuously downloads raw human NGS data uploaded to SRA using SRAtoolkit and preprocesses them using GATK pipeline. Data are then stored efficiently in a cloud data lake and can be accessed via a representational state transfer application programming interface (REST API) and a user-friendly website. We thus generated GeniePool, a simple and intuitive web service and API for querying NGS data from SRA with direct access to information related to each sample and related studies, providing significant advantages over existing databases for both clinical and research usages. Utilizing data lake infrastructure, we were able to generate a multi-purpose tool that can serve many clinical and research use cases. We expect users to explore the meta-data served via GeniePool both in daily clinical practice and in versatile research endeavours. Database URL https://geniepool.link
format	Online Article Text
id	pubmed-10263466
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-102634662023-06-15 GeniePool: genomic database with corresponding annotated samples based on a cloud data lake architecture Hadar, Noam Weintraub, Grisha Gudes, Ehud Dolev, Shlomi Birk, Ohad S Database (Oxford) Original Article In recent years, there are a huge influx of genomic data and a growing need for its phenotypic correlations, yet existing genomic databases do not allow easy storage and accessibility to the combined phenotypic–genotypic information. Freely accessible allele frequency (AF) databases, such as gnomAD, are crucial for evaluating variants but lack correlated phenotype data. The Sequence Read Archive (SRA) accumulates hundreds of thousands of next-generation sequencing (NGS) samples tagged by their submitters and various attributes. However, samples are stored in large raw format files, inaccessible for a common user. To make thousands of NGS samples and their corresponding additional attributes easily available to clinicians and researchers, we generated a pipeline that continuously downloads raw human NGS data uploaded to SRA using SRAtoolkit and preprocesses them using GATK pipeline. Data are then stored efficiently in a cloud data lake and can be accessed via a representational state transfer application programming interface (REST API) and a user-friendly website. We thus generated GeniePool, a simple and intuitive web service and API for querying NGS data from SRA with direct access to information related to each sample and related studies, providing significant advantages over existing databases for both clinical and research usages. Utilizing data lake infrastructure, we were able to generate a multi-purpose tool that can serve many clinical and research use cases. We expect users to explore the meta-data served via GeniePool both in daily clinical practice and in versatile research endeavours. Database URL https://geniepool.link Oxford University Press 2023-06-13 /pmc/articles/PMC10263466/ /pubmed/37311148 http://dx.doi.org/10.1093/database/baad043 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Article Hadar, Noam Weintraub, Grisha Gudes, Ehud Dolev, Shlomi Birk, Ohad S GeniePool: genomic database with corresponding annotated samples based on a cloud data lake architecture
title	GeniePool: genomic database with corresponding annotated samples based on a cloud data lake architecture
title_full	GeniePool: genomic database with corresponding annotated samples based on a cloud data lake architecture
title_fullStr	GeniePool: genomic database with corresponding annotated samples based on a cloud data lake architecture
title_full_unstemmed	GeniePool: genomic database with corresponding annotated samples based on a cloud data lake architecture
title_short	GeniePool: genomic database with corresponding annotated samples based on a cloud data lake architecture
title_sort	geniepool: genomic database with corresponding annotated samples based on a cloud data lake architecture
topic	Original Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10263466/ https://www.ncbi.nlm.nih.gov/pubmed/37311148 http://dx.doi.org/10.1093/database/baad043
work_keys_str_mv	AT hadarnoam geniepoolgenomicdatabasewithcorrespondingannotatedsamplesbasedonaclouddatalakearchitecture AT weintraubgrisha geniepoolgenomicdatabasewithcorrespondingannotatedsamplesbasedonaclouddatalakearchitecture AT gudesehud geniepoolgenomicdatabasewithcorrespondingannotatedsamplesbasedonaclouddatalakearchitecture AT dolevshlomi geniepoolgenomicdatabasewithcorrespondingannotatedsamplesbasedonaclouddatalakearchitecture AT birkohads geniepoolgenomicdatabasewithcorrespondingannotatedsamplesbasedonaclouddatalakearchitecture

GeniePool: genomic database with corresponding annotated samples based on a cloud data lake architecture

Ejemplares similares