Cargando…
Knowledge and Theme Discovery across Very Large Biological Data Sets Using Distributed Queries: A Prototype Combining Unstructured and Structured Data
As the discipline of biomedical science continues to apply new technologies capable of producing unprecedented volumes of noisy and complex biological data, it has become evident that available methods for deriving meaningful information from such data are simply not keeping pace. In order to achiev...
Autores principales: | , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2013
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3846626/ https://www.ncbi.nlm.nih.gov/pubmed/24312478 http://dx.doi.org/10.1371/journal.pone.0080503 |
_version_ | 1782293457890443264 |
---|---|
author | Mudunuri, Uma S. Khouja, Mohamad Repetski, Stephen Venkataraman, Girish Che, Anney Luke, Brian T. Girard, F. Pascal Stephens, Robert M. |
author_facet | Mudunuri, Uma S. Khouja, Mohamad Repetski, Stephen Venkataraman, Girish Che, Anney Luke, Brian T. Girard, F. Pascal Stephens, Robert M. |
author_sort | Mudunuri, Uma S. |
collection | PubMed |
description | As the discipline of biomedical science continues to apply new technologies capable of producing unprecedented volumes of noisy and complex biological data, it has become evident that available methods for deriving meaningful information from such data are simply not keeping pace. In order to achieve useful results, researchers require methods that consolidate, store and query combinations of structured and unstructured data sets efficiently and effectively. As we move towards personalized medicine, the need to combine unstructured data, such as medical literature, with large amounts of highly structured and high-throughput data such as human variation or expression data from very large cohorts, is especially urgent. For our study, we investigated a likely biomedical query using the Hadoop framework. We ran queries using native MapReduce tools we developed as well as other open source and proprietary tools. Our results suggest that the available technologies within the Big Data domain can reduce the time and effort needed to utilize and apply distributed queries over large datasets in practical clinical applications in the life sciences domain. The methodologies and technologies discussed in this paper set the stage for a more detailed evaluation that investigates how various data structures and data models are best mapped to the proper computational framework. |
format | Online Article Text |
id | pubmed-3846626 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2013 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-38466262013-12-05 Knowledge and Theme Discovery across Very Large Biological Data Sets Using Distributed Queries: A Prototype Combining Unstructured and Structured Data Mudunuri, Uma S. Khouja, Mohamad Repetski, Stephen Venkataraman, Girish Che, Anney Luke, Brian T. Girard, F. Pascal Stephens, Robert M. PLoS One Research Article As the discipline of biomedical science continues to apply new technologies capable of producing unprecedented volumes of noisy and complex biological data, it has become evident that available methods for deriving meaningful information from such data are simply not keeping pace. In order to achieve useful results, researchers require methods that consolidate, store and query combinations of structured and unstructured data sets efficiently and effectively. As we move towards personalized medicine, the need to combine unstructured data, such as medical literature, with large amounts of highly structured and high-throughput data such as human variation or expression data from very large cohorts, is especially urgent. For our study, we investigated a likely biomedical query using the Hadoop framework. We ran queries using native MapReduce tools we developed as well as other open source and proprietary tools. Our results suggest that the available technologies within the Big Data domain can reduce the time and effort needed to utilize and apply distributed queries over large datasets in practical clinical applications in the life sciences domain. The methodologies and technologies discussed in this paper set the stage for a more detailed evaluation that investigates how various data structures and data models are best mapped to the proper computational framework. Public Library of Science 2013-12-02 /pmc/articles/PMC3846626/ /pubmed/24312478 http://dx.doi.org/10.1371/journal.pone.0080503 Text en https://creativecommons.org/publicdomain/zero/1.0/ This is an open-access article distributed under the terms of the Creative Commons Public Domain declaration, which stipulates that, once placed in the public domain, this work may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. |
spellingShingle | Research Article Mudunuri, Uma S. Khouja, Mohamad Repetski, Stephen Venkataraman, Girish Che, Anney Luke, Brian T. Girard, F. Pascal Stephens, Robert M. Knowledge and Theme Discovery across Very Large Biological Data Sets Using Distributed Queries: A Prototype Combining Unstructured and Structured Data |
title | Knowledge and Theme Discovery across Very Large Biological Data Sets Using Distributed Queries: A Prototype Combining Unstructured and Structured Data |
title_full | Knowledge and Theme Discovery across Very Large Biological Data Sets Using Distributed Queries: A Prototype Combining Unstructured and Structured Data |
title_fullStr | Knowledge and Theme Discovery across Very Large Biological Data Sets Using Distributed Queries: A Prototype Combining Unstructured and Structured Data |
title_full_unstemmed | Knowledge and Theme Discovery across Very Large Biological Data Sets Using Distributed Queries: A Prototype Combining Unstructured and Structured Data |
title_short | Knowledge and Theme Discovery across Very Large Biological Data Sets Using Distributed Queries: A Prototype Combining Unstructured and Structured Data |
title_sort | knowledge and theme discovery across very large biological data sets using distributed queries: a prototype combining unstructured and structured data |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3846626/ https://www.ncbi.nlm.nih.gov/pubmed/24312478 http://dx.doi.org/10.1371/journal.pone.0080503 |
work_keys_str_mv | AT mudunuriumas knowledgeandthemediscoveryacrossverylargebiologicaldatasetsusingdistributedqueriesaprototypecombiningunstructuredandstructureddata AT khoujamohamad knowledgeandthemediscoveryacrossverylargebiologicaldatasetsusingdistributedqueriesaprototypecombiningunstructuredandstructureddata AT repetskistephen knowledgeandthemediscoveryacrossverylargebiologicaldatasetsusingdistributedqueriesaprototypecombiningunstructuredandstructureddata AT venkataramangirish knowledgeandthemediscoveryacrossverylargebiologicaldatasetsusingdistributedqueriesaprototypecombiningunstructuredandstructureddata AT cheanney knowledgeandthemediscoveryacrossverylargebiologicaldatasetsusingdistributedqueriesaprototypecombiningunstructuredandstructureddata AT lukebriant knowledgeandthemediscoveryacrossverylargebiologicaldatasetsusingdistributedqueriesaprototypecombiningunstructuredandstructureddata AT girardfpascal knowledgeandthemediscoveryacrossverylargebiologicaldatasetsusingdistributedqueriesaprototypecombiningunstructuredandstructureddata AT stephensrobertm knowledgeandthemediscoveryacrossverylargebiologicaldatasetsusingdistributedqueriesaprototypecombiningunstructuredandstructureddata |