Cargando…

Handling Big Data Scalability in Biological Domain Using Parallel and Distributed Processing: A Case of Three Biological Semantic Similarity Measures

In the field of biology, researchers need to compare genes or gene products using semantic similarity measures (SSM). Continuous data growth and diversity in data characteristics comprise what is called big data; current biological SSMs cannot handle big data. Therefore, these measures need the abil...

Descripción completa

Detalles Bibliográficos
Autores principales:	Almasoud, Ameera M., Al-Khalifa, Hend S., Al-Salman, Abdulmalik S.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Hindawi 2019
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6369486/ https://www.ncbi.nlm.nih.gov/pubmed/30809545 http://dx.doi.org/10.1155/2019/6750296

_version_	1783394201619136512
author	Almasoud, Ameera M. Al-Khalifa, Hend S. Al-Salman, Abdulmalik S.
author_facet	Almasoud, Ameera M. Al-Khalifa, Hend S. Al-Salman, Abdulmalik S.
author_sort	Almasoud, Ameera M.
collection	PubMed
description	In the field of biology, researchers need to compare genes or gene products using semantic similarity measures (SSM). Continuous data growth and diversity in data characteristics comprise what is called big data; current biological SSMs cannot handle big data. Therefore, these measures need the ability to control the size of big data. We used parallel and distributed processing by splitting data into multiple partitions and applied SSM measures to each partition; this approach helped manage big data scalability and computational problems. Our solution involves three steps: split gene ontology (GO), data clustering, and semantic similarity calculation. To test this method, split GO and data clustering algorithms were defined and assessed for performance in the first two steps. Three of the best SSMs in biology [Resnik, Shortest Semantic Differentiation Distance (SSDD), and SORA] are enhanced by introducing threaded parallel processing, which is used in the third step. Our results demonstrate that introducing threads in SSMs reduced the time of calculating semantic similarity between gene pairs and improved performance of the three SSMs. Average time was reduced by 24.51% for Resnik, 22.93%, for SSDD, and 33.68% for SORA. Total time was reduced by 8.88% for Resnik, 23.14% for SSDD, and 39.27% for SORA. Using these threaded measures in the distributed system, combined with using split GO and data clustering algorithms to split input data based on their similarity, reduced the average time more than did the approach of equally dividing input data. Time reduction increased with increasing number of splits. Time reduction percentage was 24.1%, 39.2%, and 66.6% for Threaded SSDD; 33.0%, 78.2%, and 93.1% for Threaded SORA in the case of 2, 3, and 4 slaves, respectively; and 92.04% for Threaded Resnik in the case of four slaves.
format	Online Article Text
id	pubmed-6369486
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	Hindawi
record_format	MEDLINE/PubMed
spelling	pubmed-63694862019-02-26 Handling Big Data Scalability in Biological Domain Using Parallel and Distributed Processing: A Case of Three Biological Semantic Similarity Measures Almasoud, Ameera M. Al-Khalifa, Hend S. Al-Salman, Abdulmalik S. Biomed Res Int Research Article In the field of biology, researchers need to compare genes or gene products using semantic similarity measures (SSM). Continuous data growth and diversity in data characteristics comprise what is called big data; current biological SSMs cannot handle big data. Therefore, these measures need the ability to control the size of big data. We used parallel and distributed processing by splitting data into multiple partitions and applied SSM measures to each partition; this approach helped manage big data scalability and computational problems. Our solution involves three steps: split gene ontology (GO), data clustering, and semantic similarity calculation. To test this method, split GO and data clustering algorithms were defined and assessed for performance in the first two steps. Three of the best SSMs in biology [Resnik, Shortest Semantic Differentiation Distance (SSDD), and SORA] are enhanced by introducing threaded parallel processing, which is used in the third step. Our results demonstrate that introducing threads in SSMs reduced the time of calculating semantic similarity between gene pairs and improved performance of the three SSMs. Average time was reduced by 24.51% for Resnik, 22.93%, for SSDD, and 33.68% for SORA. Total time was reduced by 8.88% for Resnik, 23.14% for SSDD, and 39.27% for SORA. Using these threaded measures in the distributed system, combined with using split GO and data clustering algorithms to split input data based on their similarity, reduced the average time more than did the approach of equally dividing input data. Time reduction increased with increasing number of splits. Time reduction percentage was 24.1%, 39.2%, and 66.6% for Threaded SSDD; 33.0%, 78.2%, and 93.1% for Threaded SORA in the case of 2, 3, and 4 slaves, respectively; and 92.04% for Threaded Resnik in the case of four slaves. Hindawi 2019-01-27 /pmc/articles/PMC6369486/ /pubmed/30809545 http://dx.doi.org/10.1155/2019/6750296 Text en Copyright © 2019 Ameera M. Almasoud et al. https://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Almasoud, Ameera M. Al-Khalifa, Hend S. Al-Salman, Abdulmalik S. Handling Big Data Scalability in Biological Domain Using Parallel and Distributed Processing: A Case of Three Biological Semantic Similarity Measures
title	Handling Big Data Scalability in Biological Domain Using Parallel and Distributed Processing: A Case of Three Biological Semantic Similarity Measures
title_full	Handling Big Data Scalability in Biological Domain Using Parallel and Distributed Processing: A Case of Three Biological Semantic Similarity Measures
title_fullStr	Handling Big Data Scalability in Biological Domain Using Parallel and Distributed Processing: A Case of Three Biological Semantic Similarity Measures
title_full_unstemmed	Handling Big Data Scalability in Biological Domain Using Parallel and Distributed Processing: A Case of Three Biological Semantic Similarity Measures
title_short	Handling Big Data Scalability in Biological Domain Using Parallel and Distributed Processing: A Case of Three Biological Semantic Similarity Measures
title_sort	handling big data scalability in biological domain using parallel and distributed processing: a case of three biological semantic similarity measures
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6369486/ https://www.ncbi.nlm.nih.gov/pubmed/30809545 http://dx.doi.org/10.1155/2019/6750296
work_keys_str_mv	AT almasoudameeram handlingbigdatascalabilityinbiologicaldomainusingparallelanddistributedprocessingacaseofthreebiologicalsemanticsimilaritymeasures AT alkhalifahends handlingbigdatascalabilityinbiologicaldomainusingparallelanddistributedprocessingacaseofthreebiologicalsemanticsimilaritymeasures AT alsalmanabdulmaliks handlingbigdatascalabilityinbiologicaldomainusingparallelanddistributedprocessingacaseofthreebiologicalsemanticsimilaritymeasures

Handling Big Data Scalability in Biological Domain Using Parallel and Distributed Processing: A Case of Three Biological Semantic Similarity Measures

Ejemplares similares