Cargando…
GeniePool: genomic database with corresponding annotated samples based on a cloud data lake architecture
In recent years, there are a huge influx of genomic data and a growing need for its phenotypic correlations, yet existing genomic databases do not allow easy storage and accessibility to the combined phenotypic–genotypic information. Freely accessible allele frequency (AF) databases, such as gnomAD,...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10263466/ https://www.ncbi.nlm.nih.gov/pubmed/37311148 http://dx.doi.org/10.1093/database/baad043 |
_version_ | 1785058247682555904 |
---|---|
author | Hadar, Noam Weintraub, Grisha Gudes, Ehud Dolev, Shlomi Birk, Ohad S |
author_facet | Hadar, Noam Weintraub, Grisha Gudes, Ehud Dolev, Shlomi Birk, Ohad S |
author_sort | Hadar, Noam |
collection | PubMed |
description | In recent years, there are a huge influx of genomic data and a growing need for its phenotypic correlations, yet existing genomic databases do not allow easy storage and accessibility to the combined phenotypic–genotypic information. Freely accessible allele frequency (AF) databases, such as gnomAD, are crucial for evaluating variants but lack correlated phenotype data. The Sequence Read Archive (SRA) accumulates hundreds of thousands of next-generation sequencing (NGS) samples tagged by their submitters and various attributes. However, samples are stored in large raw format files, inaccessible for a common user. To make thousands of NGS samples and their corresponding additional attributes easily available to clinicians and researchers, we generated a pipeline that continuously downloads raw human NGS data uploaded to SRA using SRAtoolkit and preprocesses them using GATK pipeline. Data are then stored efficiently in a cloud data lake and can be accessed via a representational state transfer application programming interface (REST API) and a user-friendly website. We thus generated GeniePool, a simple and intuitive web service and API for querying NGS data from SRA with direct access to information related to each sample and related studies, providing significant advantages over existing databases for both clinical and research usages. Utilizing data lake infrastructure, we were able to generate a multi-purpose tool that can serve many clinical and research use cases. We expect users to explore the meta-data served via GeniePool both in daily clinical practice and in versatile research endeavours. Database URL https://geniepool.link |
format | Online Article Text |
id | pubmed-10263466 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-102634662023-06-15 GeniePool: genomic database with corresponding annotated samples based on a cloud data lake architecture Hadar, Noam Weintraub, Grisha Gudes, Ehud Dolev, Shlomi Birk, Ohad S Database (Oxford) Original Article In recent years, there are a huge influx of genomic data and a growing need for its phenotypic correlations, yet existing genomic databases do not allow easy storage and accessibility to the combined phenotypic–genotypic information. Freely accessible allele frequency (AF) databases, such as gnomAD, are crucial for evaluating variants but lack correlated phenotype data. The Sequence Read Archive (SRA) accumulates hundreds of thousands of next-generation sequencing (NGS) samples tagged by their submitters and various attributes. However, samples are stored in large raw format files, inaccessible for a common user. To make thousands of NGS samples and their corresponding additional attributes easily available to clinicians and researchers, we generated a pipeline that continuously downloads raw human NGS data uploaded to SRA using SRAtoolkit and preprocesses them using GATK pipeline. Data are then stored efficiently in a cloud data lake and can be accessed via a representational state transfer application programming interface (REST API) and a user-friendly website. We thus generated GeniePool, a simple and intuitive web service and API for querying NGS data from SRA with direct access to information related to each sample and related studies, providing significant advantages over existing databases for both clinical and research usages. Utilizing data lake infrastructure, we were able to generate a multi-purpose tool that can serve many clinical and research use cases. We expect users to explore the meta-data served via GeniePool both in daily clinical practice and in versatile research endeavours. Database URL https://geniepool.link Oxford University Press 2023-06-13 /pmc/articles/PMC10263466/ /pubmed/37311148 http://dx.doi.org/10.1093/database/baad043 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Original Article Hadar, Noam Weintraub, Grisha Gudes, Ehud Dolev, Shlomi Birk, Ohad S GeniePool: genomic database with corresponding annotated samples based on a cloud data lake architecture |
title | GeniePool: genomic database with corresponding annotated samples based on a cloud data lake architecture |
title_full | GeniePool: genomic database with corresponding annotated samples based on a cloud data lake architecture |
title_fullStr | GeniePool: genomic database with corresponding annotated samples based on a cloud data lake architecture |
title_full_unstemmed | GeniePool: genomic database with corresponding annotated samples based on a cloud data lake architecture |
title_short | GeniePool: genomic database with corresponding annotated samples based on a cloud data lake architecture |
title_sort | geniepool: genomic database with corresponding annotated samples based on a cloud data lake architecture |
topic | Original Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10263466/ https://www.ncbi.nlm.nih.gov/pubmed/37311148 http://dx.doi.org/10.1093/database/baad043 |
work_keys_str_mv | AT hadarnoam geniepoolgenomicdatabasewithcorrespondingannotatedsamplesbasedonaclouddatalakearchitecture AT weintraubgrisha geniepoolgenomicdatabasewithcorrespondingannotatedsamplesbasedonaclouddatalakearchitecture AT gudesehud geniepoolgenomicdatabasewithcorrespondingannotatedsamplesbasedonaclouddatalakearchitecture AT dolevshlomi geniepoolgenomicdatabasewithcorrespondingannotatedsamplesbasedonaclouddatalakearchitecture AT birkohads geniepoolgenomicdatabasewithcorrespondingannotatedsamplesbasedonaclouddatalakearchitecture |