Cargando…

High dimensional biological data retrieval optimization with NoSQL technology

BACKGROUND: High-throughput transcriptomic data generated by microarray experiments is the most abundant and frequently stored kind of data currently used in translational medicine studies. Although microarray data is supported in data warehouses such as tranSMART, when querying relational databases...

Descripción completa

Detalles Bibliográficos
Autores principales:	Wang, Shicai, Pandis, Ioannis, Wu, Chao, He, Sijin, Johnson, David, Emam, Ibrahim, Guitton, Florian, Guo, Yike
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2014
Materias:	Proceedings
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4248814/ https://www.ncbi.nlm.nih.gov/pubmed/25435347 http://dx.doi.org/10.1186/1471-2164-15-S8-S3

_version_	1782346848129777664
author	Wang, Shicai Pandis, Ioannis Wu, Chao He, Sijin Johnson, David Emam, Ibrahim Guitton, Florian Guo, Yike
author_facet	Wang, Shicai Pandis, Ioannis Wu, Chao He, Sijin Johnson, David Emam, Ibrahim Guitton, Florian Guo, Yike
author_sort	Wang, Shicai
collection	PubMed
description	BACKGROUND: High-throughput transcriptomic data generated by microarray experiments is the most abundant and frequently stored kind of data currently used in translational medicine studies. Although microarray data is supported in data warehouses such as tranSMART, when querying relational databases for hundreds of different patient gene expression records queries are slow due to poor performance. Non-relational data models, such as the key-value model implemented in NoSQL databases, hold promise to be more performant solutions. Our motivation is to improve the performance of the tranSMART data warehouse with a view to supporting Next Generation Sequencing data. RESULTS: In this paper we introduce a new data model better suited for high-dimensional data storage and querying, optimized for database scalability and performance. We have designed a key-value pair data model to support faster queries over large-scale microarray data and implemented the model using HBase, an implementation of Google's BigTable storage system. An experimental performance comparison was carried out against the traditional relational data model implemented in both MySQL Cluster and MongoDB, using a large publicly available transcriptomic data set taken from NCBI GEO concerning Multiple Myeloma. Our new key-value data model implemented on HBase exhibits an average 5.24-fold increase in high-dimensional biological data query performance compared to the relational model implemented on MySQL Cluster, and an average 6.47-fold increase on query performance on MongoDB. CONCLUSIONS: The performance evaluation found that the new key-value data model, in particular its implementation in HBase, outperforms the relational model currently implemented in tranSMART. We propose that NoSQL technology holds great promise for large-scale data management, in particular for high-dimensional biological data such as that demonstrated in the performance evaluation described in this paper. We aim to use this new data model as a basis for migrating tranSMART's implementation to a more scalable solution for Big Data.
format	Online Article Text
id	pubmed-4248814
institution	National Center for Biotechnology Information
language	English
publishDate	2014
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-42488142014-12-04 High dimensional biological data retrieval optimization with NoSQL technology Wang, Shicai Pandis, Ioannis Wu, Chao He, Sijin Johnson, David Emam, Ibrahim Guitton, Florian Guo, Yike BMC Genomics Proceedings BACKGROUND: High-throughput transcriptomic data generated by microarray experiments is the most abundant and frequently stored kind of data currently used in translational medicine studies. Although microarray data is supported in data warehouses such as tranSMART, when querying relational databases for hundreds of different patient gene expression records queries are slow due to poor performance. Non-relational data models, such as the key-value model implemented in NoSQL databases, hold promise to be more performant solutions. Our motivation is to improve the performance of the tranSMART data warehouse with a view to supporting Next Generation Sequencing data. RESULTS: In this paper we introduce a new data model better suited for high-dimensional data storage and querying, optimized for database scalability and performance. We have designed a key-value pair data model to support faster queries over large-scale microarray data and implemented the model using HBase, an implementation of Google's BigTable storage system. An experimental performance comparison was carried out against the traditional relational data model implemented in both MySQL Cluster and MongoDB, using a large publicly available transcriptomic data set taken from NCBI GEO concerning Multiple Myeloma. Our new key-value data model implemented on HBase exhibits an average 5.24-fold increase in high-dimensional biological data query performance compared to the relational model implemented on MySQL Cluster, and an average 6.47-fold increase on query performance on MongoDB. CONCLUSIONS: The performance evaluation found that the new key-value data model, in particular its implementation in HBase, outperforms the relational model currently implemented in tranSMART. We propose that NoSQL technology holds great promise for large-scale data management, in particular for high-dimensional biological data such as that demonstrated in the performance evaluation described in this paper. We aim to use this new data model as a basis for migrating tranSMART's implementation to a more scalable solution for Big Data. BioMed Central 2014-11-13 /pmc/articles/PMC4248814/ /pubmed/25435347 http://dx.doi.org/10.1186/1471-2164-15-S8-S3 Text en Copyright © 2014 Wang et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Proceedings Wang, Shicai Pandis, Ioannis Wu, Chao He, Sijin Johnson, David Emam, Ibrahim Guitton, Florian Guo, Yike High dimensional biological data retrieval optimization with NoSQL technology
title	High dimensional biological data retrieval optimization with NoSQL technology
title_full	High dimensional biological data retrieval optimization with NoSQL technology
title_fullStr	High dimensional biological data retrieval optimization with NoSQL technology
title_full_unstemmed	High dimensional biological data retrieval optimization with NoSQL technology
title_short	High dimensional biological data retrieval optimization with NoSQL technology
title_sort	high dimensional biological data retrieval optimization with nosql technology
topic	Proceedings
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4248814/ https://www.ncbi.nlm.nih.gov/pubmed/25435347 http://dx.doi.org/10.1186/1471-2164-15-S8-S3
work_keys_str_mv	AT wangshicai highdimensionalbiologicaldataretrievaloptimizationwithnosqltechnology AT pandisioannis highdimensionalbiologicaldataretrievaloptimizationwithnosqltechnology AT wuchao highdimensionalbiologicaldataretrievaloptimizationwithnosqltechnology AT hesijin highdimensionalbiologicaldataretrievaloptimizationwithnosqltechnology AT johnsondavid highdimensionalbiologicaldataretrievaloptimizationwithnosqltechnology AT emamibrahim highdimensionalbiologicaldataretrievaloptimizationwithnosqltechnology AT guittonflorian highdimensionalbiologicaldataretrievaloptimizationwithnosqltechnology AT guoyike highdimensionalbiologicaldataretrievaloptimizationwithnosqltechnology

High dimensional biological data retrieval optimization with NoSQL technology

Ejemplares similares