Cargando…

Benchmarking distributed data warehouse solutions for storing genomic variant information

Genomic-based personalized medicine encompasses storing, analysing and interpreting genomic variants as its central issues. At a time when thousands of patientss sequenced exomes and genomes are becoming available, there is a growing need for efficient database storage and querying. The answer could...

Descripción completa

Detalles Bibliográficos
Autores principales: Wiewiórka, Marek S., Wysakowicz, Dawid P., Okoniewski, Michał J., Gambin, Tomasz
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5504537/
https://www.ncbi.nlm.nih.gov/pubmed/29220442
http://dx.doi.org/10.1093/database/bax049
_version_ 1783249291147476992
author Wiewiórka, Marek S.
Wysakowicz, Dawid P.
Okoniewski, Michał J.
Gambin, Tomasz
author_facet Wiewiórka, Marek S.
Wysakowicz, Dawid P.
Okoniewski, Michał J.
Gambin, Tomasz
author_sort Wiewiórka, Marek S.
collection PubMed
description Genomic-based personalized medicine encompasses storing, analysing and interpreting genomic variants as its central issues. At a time when thousands of patientss sequenced exomes and genomes are becoming available, there is a growing need for efficient database storage and querying. The answer could be the application of modern distributed storage systems and query engines. However, the application of large genomic variant databases to this problem has not been sufficiently far explored so far in the literature. To investigate the effectiveness of modern columnar storage [column-oriented Database Management System (DBMS)] and query engines, we have developed a prototypic genomic variant data warehouse, populated with large generated content of genomic variants and phenotypic data. Next, we have benchmarked performance of a number of combinations of distributed storages and query engines on a set of SQL queries that address biological questions essential for both research and medical applications. In addition, a non-distributed, analytical database (MonetDB) has been used as a baseline. Comparison of query execution times confirms that distributed data warehousing solutions outperform classic relational DBMSs. Moreover, pre-aggregation and further denormalization of data, which reduce the number of distributed join operations, significantly improve query performance by several orders of magnitude. Most of distributed back-ends offer a good performance for complex analytical queries, while the Optimized Row Columnar (ORC) format paired with Presto and Parquet with Spark 2 query engines provide, on average, the lowest execution times. Apache Kudu on the other hand, is the only solution that guarantees a sub-second performance for simple genome range queries returning a small subset of data, where low-latency response is expected, while still offering decent performance for running analytical queries. In summary, research and clinical applications that require the storage and analysis of variants from thousands of samples can benefit from the scalability and performance of distributed data warehouse solutions. Database URL: https://github.com/ZSI-Bio/variantsdwh
format Online
Article
Text
id pubmed-5504537
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-55045372017-07-12 Benchmarking distributed data warehouse solutions for storing genomic variant information Wiewiórka, Marek S. Wysakowicz, Dawid P. Okoniewski, Michał J. Gambin, Tomasz Database (Oxford) Original Article Genomic-based personalized medicine encompasses storing, analysing and interpreting genomic variants as its central issues. At a time when thousands of patientss sequenced exomes and genomes are becoming available, there is a growing need for efficient database storage and querying. The answer could be the application of modern distributed storage systems and query engines. However, the application of large genomic variant databases to this problem has not been sufficiently far explored so far in the literature. To investigate the effectiveness of modern columnar storage [column-oriented Database Management System (DBMS)] and query engines, we have developed a prototypic genomic variant data warehouse, populated with large generated content of genomic variants and phenotypic data. Next, we have benchmarked performance of a number of combinations of distributed storages and query engines on a set of SQL queries that address biological questions essential for both research and medical applications. In addition, a non-distributed, analytical database (MonetDB) has been used as a baseline. Comparison of query execution times confirms that distributed data warehousing solutions outperform classic relational DBMSs. Moreover, pre-aggregation and further denormalization of data, which reduce the number of distributed join operations, significantly improve query performance by several orders of magnitude. Most of distributed back-ends offer a good performance for complex analytical queries, while the Optimized Row Columnar (ORC) format paired with Presto and Parquet with Spark 2 query engines provide, on average, the lowest execution times. Apache Kudu on the other hand, is the only solution that guarantees a sub-second performance for simple genome range queries returning a small subset of data, where low-latency response is expected, while still offering decent performance for running analytical queries. In summary, research and clinical applications that require the storage and analysis of variants from thousands of samples can benefit from the scalability and performance of distributed data warehouse solutions. Database URL: https://github.com/ZSI-Bio/variantsdwh Oxford University Press 2017-07-11 /pmc/articles/PMC5504537/ /pubmed/29220442 http://dx.doi.org/10.1093/database/bax049 Text en © The Author(s) 2017. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Article
Wiewiórka, Marek S.
Wysakowicz, Dawid P.
Okoniewski, Michał J.
Gambin, Tomasz
Benchmarking distributed data warehouse solutions for storing genomic variant information
title Benchmarking distributed data warehouse solutions for storing genomic variant information
title_full Benchmarking distributed data warehouse solutions for storing genomic variant information
title_fullStr Benchmarking distributed data warehouse solutions for storing genomic variant information
title_full_unstemmed Benchmarking distributed data warehouse solutions for storing genomic variant information
title_short Benchmarking distributed data warehouse solutions for storing genomic variant information
title_sort benchmarking distributed data warehouse solutions for storing genomic variant information
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5504537/
https://www.ncbi.nlm.nih.gov/pubmed/29220442
http://dx.doi.org/10.1093/database/bax049
work_keys_str_mv AT wiewiorkamareks benchmarkingdistributeddatawarehousesolutionsforstoringgenomicvariantinformation
AT wysakowiczdawidp benchmarkingdistributeddatawarehousesolutionsforstoringgenomicvariantinformation
AT okoniewskimichałj benchmarkingdistributeddatawarehousesolutionsforstoringgenomicvariantinformation
AT gambintomasz benchmarkingdistributeddatawarehousesolutionsforstoringgenomicvariantinformation