Cargando…

GeniePool: genomic database with corresponding annotated samples based on a cloud data lake architecture

In recent years, there are a huge influx of genomic data and a growing need for its phenotypic correlations, yet existing genomic databases do not allow easy storage and accessibility to the combined phenotypic–genotypic information. Freely accessible allele frequency (AF) databases, such as gnomAD,...

Descripción completa

Detalles Bibliográficos
Autores principales: Hadar, Noam, Weintraub, Grisha, Gudes, Ehud, Dolev, Shlomi, Birk, Ohad S
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10263466/
https://www.ncbi.nlm.nih.gov/pubmed/37311148
http://dx.doi.org/10.1093/database/baad043
_version_ 1785058247682555904
author Hadar, Noam
Weintraub, Grisha
Gudes, Ehud
Dolev, Shlomi
Birk, Ohad S
author_facet Hadar, Noam
Weintraub, Grisha
Gudes, Ehud
Dolev, Shlomi
Birk, Ohad S
author_sort Hadar, Noam
collection PubMed
description In recent years, there are a huge influx of genomic data and a growing need for its phenotypic correlations, yet existing genomic databases do not allow easy storage and accessibility to the combined phenotypic–genotypic information. Freely accessible allele frequency (AF) databases, such as gnomAD, are crucial for evaluating variants but lack correlated phenotype data. The Sequence Read Archive (SRA) accumulates hundreds of thousands of next-generation sequencing (NGS) samples tagged by their submitters and various attributes. However, samples are stored in large raw format files, inaccessible for a common user. To make thousands of NGS samples and their corresponding additional attributes easily available to clinicians and researchers, we generated a pipeline that continuously downloads raw human NGS data uploaded to SRA using SRAtoolkit and preprocesses them using GATK pipeline. Data are then stored efficiently in a cloud data lake and can be accessed via a representational state transfer application programming interface (REST API) and a user-friendly website. We thus generated GeniePool, a simple and intuitive web service and API for querying NGS data from SRA with direct access to information related to each sample and related studies, providing significant advantages over existing databases for both clinical and research usages. Utilizing data lake infrastructure, we were able to generate a multi-purpose tool that can serve many clinical and research use cases. We expect users to explore the meta-data served via GeniePool both in daily clinical practice and in versatile research endeavours. Database URL https://geniepool.link
format Online
Article
Text
id pubmed-10263466
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-102634662023-06-15 GeniePool: genomic database with corresponding annotated samples based on a cloud data lake architecture Hadar, Noam Weintraub, Grisha Gudes, Ehud Dolev, Shlomi Birk, Ohad S Database (Oxford) Original Article In recent years, there are a huge influx of genomic data and a growing need for its phenotypic correlations, yet existing genomic databases do not allow easy storage and accessibility to the combined phenotypic–genotypic information. Freely accessible allele frequency (AF) databases, such as gnomAD, are crucial for evaluating variants but lack correlated phenotype data. The Sequence Read Archive (SRA) accumulates hundreds of thousands of next-generation sequencing (NGS) samples tagged by their submitters and various attributes. However, samples are stored in large raw format files, inaccessible for a common user. To make thousands of NGS samples and their corresponding additional attributes easily available to clinicians and researchers, we generated a pipeline that continuously downloads raw human NGS data uploaded to SRA using SRAtoolkit and preprocesses them using GATK pipeline. Data are then stored efficiently in a cloud data lake and can be accessed via a representational state transfer application programming interface (REST API) and a user-friendly website. We thus generated GeniePool, a simple and intuitive web service and API for querying NGS data from SRA with direct access to information related to each sample and related studies, providing significant advantages over existing databases for both clinical and research usages. Utilizing data lake infrastructure, we were able to generate a multi-purpose tool that can serve many clinical and research use cases. We expect users to explore the meta-data served via GeniePool both in daily clinical practice and in versatile research endeavours. Database URL https://geniepool.link Oxford University Press 2023-06-13 /pmc/articles/PMC10263466/ /pubmed/37311148 http://dx.doi.org/10.1093/database/baad043 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Article
Hadar, Noam
Weintraub, Grisha
Gudes, Ehud
Dolev, Shlomi
Birk, Ohad S
GeniePool: genomic database with corresponding annotated samples based on a cloud data lake architecture
title GeniePool: genomic database with corresponding annotated samples based on a cloud data lake architecture
title_full GeniePool: genomic database with corresponding annotated samples based on a cloud data lake architecture
title_fullStr GeniePool: genomic database with corresponding annotated samples based on a cloud data lake architecture
title_full_unstemmed GeniePool: genomic database with corresponding annotated samples based on a cloud data lake architecture
title_short GeniePool: genomic database with corresponding annotated samples based on a cloud data lake architecture
title_sort geniepool: genomic database with corresponding annotated samples based on a cloud data lake architecture
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10263466/
https://www.ncbi.nlm.nih.gov/pubmed/37311148
http://dx.doi.org/10.1093/database/baad043
work_keys_str_mv AT hadarnoam geniepoolgenomicdatabasewithcorrespondingannotatedsamplesbasedonaclouddatalakearchitecture
AT weintraubgrisha geniepoolgenomicdatabasewithcorrespondingannotatedsamplesbasedonaclouddatalakearchitecture
AT gudesehud geniepoolgenomicdatabasewithcorrespondingannotatedsamplesbasedonaclouddatalakearchitecture
AT dolevshlomi geniepoolgenomicdatabasewithcorrespondingannotatedsamplesbasedonaclouddatalakearchitecture
AT birkohads geniepoolgenomicdatabasewithcorrespondingannotatedsamplesbasedonaclouddatalakearchitecture