Cargando…

Knowledge and Theme Discovery across Very Large Biological Data Sets Using Distributed Queries: A Prototype Combining Unstructured and Structured Data

As the discipline of biomedical science continues to apply new technologies capable of producing unprecedented volumes of noisy and complex biological data, it has become evident that available methods for deriving meaningful information from such data are simply not keeping pace. In order to achiev...

Descripción completa

Detalles Bibliográficos
Autores principales: Mudunuri, Uma S., Khouja, Mohamad, Repetski, Stephen, Venkataraman, Girish, Che, Anney, Luke, Brian T., Girard, F. Pascal, Stephens, Robert M.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3846626/
https://www.ncbi.nlm.nih.gov/pubmed/24312478
http://dx.doi.org/10.1371/journal.pone.0080503
_version_ 1782293457890443264
author Mudunuri, Uma S.
Khouja, Mohamad
Repetski, Stephen
Venkataraman, Girish
Che, Anney
Luke, Brian T.
Girard, F. Pascal
Stephens, Robert M.
author_facet Mudunuri, Uma S.
Khouja, Mohamad
Repetski, Stephen
Venkataraman, Girish
Che, Anney
Luke, Brian T.
Girard, F. Pascal
Stephens, Robert M.
author_sort Mudunuri, Uma S.
collection PubMed
description As the discipline of biomedical science continues to apply new technologies capable of producing unprecedented volumes of noisy and complex biological data, it has become evident that available methods for deriving meaningful information from such data are simply not keeping pace. In order to achieve useful results, researchers require methods that consolidate, store and query combinations of structured and unstructured data sets efficiently and effectively. As we move towards personalized medicine, the need to combine unstructured data, such as medical literature, with large amounts of highly structured and high-throughput data such as human variation or expression data from very large cohorts, is especially urgent. For our study, we investigated a likely biomedical query using the Hadoop framework. We ran queries using native MapReduce tools we developed as well as other open source and proprietary tools. Our results suggest that the available technologies within the Big Data domain can reduce the time and effort needed to utilize and apply distributed queries over large datasets in practical clinical applications in the life sciences domain. The methodologies and technologies discussed in this paper set the stage for a more detailed evaluation that investigates how various data structures and data models are best mapped to the proper computational framework.
format Online
Article
Text
id pubmed-3846626
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-38466262013-12-05 Knowledge and Theme Discovery across Very Large Biological Data Sets Using Distributed Queries: A Prototype Combining Unstructured and Structured Data Mudunuri, Uma S. Khouja, Mohamad Repetski, Stephen Venkataraman, Girish Che, Anney Luke, Brian T. Girard, F. Pascal Stephens, Robert M. PLoS One Research Article As the discipline of biomedical science continues to apply new technologies capable of producing unprecedented volumes of noisy and complex biological data, it has become evident that available methods for deriving meaningful information from such data are simply not keeping pace. In order to achieve useful results, researchers require methods that consolidate, store and query combinations of structured and unstructured data sets efficiently and effectively. As we move towards personalized medicine, the need to combine unstructured data, such as medical literature, with large amounts of highly structured and high-throughput data such as human variation or expression data from very large cohorts, is especially urgent. For our study, we investigated a likely biomedical query using the Hadoop framework. We ran queries using native MapReduce tools we developed as well as other open source and proprietary tools. Our results suggest that the available technologies within the Big Data domain can reduce the time and effort needed to utilize and apply distributed queries over large datasets in practical clinical applications in the life sciences domain. The methodologies and technologies discussed in this paper set the stage for a more detailed evaluation that investigates how various data structures and data models are best mapped to the proper computational framework. Public Library of Science 2013-12-02 /pmc/articles/PMC3846626/ /pubmed/24312478 http://dx.doi.org/10.1371/journal.pone.0080503 Text en https://creativecommons.org/publicdomain/zero/1.0/ This is an open-access article distributed under the terms of the Creative Commons Public Domain declaration, which stipulates that, once placed in the public domain, this work may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose.
spellingShingle Research Article
Mudunuri, Uma S.
Khouja, Mohamad
Repetski, Stephen
Venkataraman, Girish
Che, Anney
Luke, Brian T.
Girard, F. Pascal
Stephens, Robert M.
Knowledge and Theme Discovery across Very Large Biological Data Sets Using Distributed Queries: A Prototype Combining Unstructured and Structured Data
title Knowledge and Theme Discovery across Very Large Biological Data Sets Using Distributed Queries: A Prototype Combining Unstructured and Structured Data
title_full Knowledge and Theme Discovery across Very Large Biological Data Sets Using Distributed Queries: A Prototype Combining Unstructured and Structured Data
title_fullStr Knowledge and Theme Discovery across Very Large Biological Data Sets Using Distributed Queries: A Prototype Combining Unstructured and Structured Data
title_full_unstemmed Knowledge and Theme Discovery across Very Large Biological Data Sets Using Distributed Queries: A Prototype Combining Unstructured and Structured Data
title_short Knowledge and Theme Discovery across Very Large Biological Data Sets Using Distributed Queries: A Prototype Combining Unstructured and Structured Data
title_sort knowledge and theme discovery across very large biological data sets using distributed queries: a prototype combining unstructured and structured data
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3846626/
https://www.ncbi.nlm.nih.gov/pubmed/24312478
http://dx.doi.org/10.1371/journal.pone.0080503
work_keys_str_mv AT mudunuriumas knowledgeandthemediscoveryacrossverylargebiologicaldatasetsusingdistributedqueriesaprototypecombiningunstructuredandstructureddata
AT khoujamohamad knowledgeandthemediscoveryacrossverylargebiologicaldatasetsusingdistributedqueriesaprototypecombiningunstructuredandstructureddata
AT repetskistephen knowledgeandthemediscoveryacrossverylargebiologicaldatasetsusingdistributedqueriesaprototypecombiningunstructuredandstructureddata
AT venkataramangirish knowledgeandthemediscoveryacrossverylargebiologicaldatasetsusingdistributedqueriesaprototypecombiningunstructuredandstructureddata
AT cheanney knowledgeandthemediscoveryacrossverylargebiologicaldatasetsusingdistributedqueriesaprototypecombiningunstructuredandstructureddata
AT lukebriant knowledgeandthemediscoveryacrossverylargebiologicaldatasetsusingdistributedqueriesaprototypecombiningunstructuredandstructureddata
AT girardfpascal knowledgeandthemediscoveryacrossverylargebiologicaldatasetsusingdistributedqueriesaprototypecombiningunstructuredandstructureddata
AT stephensrobertm knowledgeandthemediscoveryacrossverylargebiologicaldatasetsusingdistributedqueriesaprototypecombiningunstructuredandstructureddata