Cargando…

An Efficient Parallelized Ontology Network-Based Semantic Similarity Measure for Big Biomedical Document Clustering

Semantic mining is always a challenge for big biomedical text data. Ontology has been widely proved and used to extract semantic information. However, the process of ontology-based semantic similarity calculation is so complex that it cannot measure the similarity for big text data. To solve this pr...

Descripción completa

Detalles Bibliográficos
Autores principales: Li, Meijing, Chen, Tianjie, Ryu, Keun Ho, Jin, Cheng Hao
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Hindawi 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8594978/
https://www.ncbi.nlm.nih.gov/pubmed/34795792
http://dx.doi.org/10.1155/2021/7937573
_version_ 1784600093065740288
author Li, Meijing
Chen, Tianjie
Ryu, Keun Ho
Jin, Cheng Hao
author_facet Li, Meijing
Chen, Tianjie
Ryu, Keun Ho
Jin, Cheng Hao
author_sort Li, Meijing
collection PubMed
description Semantic mining is always a challenge for big biomedical text data. Ontology has been widely proved and used to extract semantic information. However, the process of ontology-based semantic similarity calculation is so complex that it cannot measure the similarity for big text data. To solve this problem, we propose a parallelized semantic similarity measurement method based on Hadoop MapReduce for big text data. At first, we preprocess and extract the semantic features from documents. Then, we calculate the document semantic similarity based on ontology network structure under MapReduce framework. Finally, based on the generated semantic document similarity, document clusters are generated via clustering algorithms. To validate the effectiveness, we use two kinds of open datasets. The experimental results show that the traditional methods can hardly work for more than ten thousand biomedical documents. The proposed method keeps efficient and accurate for big dataset and is of high parallelism and scalability.
format Online
Article
Text
id pubmed-8594978
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Hindawi
record_format MEDLINE/PubMed
spelling pubmed-85949782021-11-17 An Efficient Parallelized Ontology Network-Based Semantic Similarity Measure for Big Biomedical Document Clustering Li, Meijing Chen, Tianjie Ryu, Keun Ho Jin, Cheng Hao Comput Math Methods Med Research Article Semantic mining is always a challenge for big biomedical text data. Ontology has been widely proved and used to extract semantic information. However, the process of ontology-based semantic similarity calculation is so complex that it cannot measure the similarity for big text data. To solve this problem, we propose a parallelized semantic similarity measurement method based on Hadoop MapReduce for big text data. At first, we preprocess and extract the semantic features from documents. Then, we calculate the document semantic similarity based on ontology network structure under MapReduce framework. Finally, based on the generated semantic document similarity, document clusters are generated via clustering algorithms. To validate the effectiveness, we use two kinds of open datasets. The experimental results show that the traditional methods can hardly work for more than ten thousand biomedical documents. The proposed method keeps efficient and accurate for big dataset and is of high parallelism and scalability. Hindawi 2021-11-09 /pmc/articles/PMC8594978/ /pubmed/34795792 http://dx.doi.org/10.1155/2021/7937573 Text en Copyright © 2021 Meijing Li et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Li, Meijing
Chen, Tianjie
Ryu, Keun Ho
Jin, Cheng Hao
An Efficient Parallelized Ontology Network-Based Semantic Similarity Measure for Big Biomedical Document Clustering
title An Efficient Parallelized Ontology Network-Based Semantic Similarity Measure for Big Biomedical Document Clustering
title_full An Efficient Parallelized Ontology Network-Based Semantic Similarity Measure for Big Biomedical Document Clustering
title_fullStr An Efficient Parallelized Ontology Network-Based Semantic Similarity Measure for Big Biomedical Document Clustering
title_full_unstemmed An Efficient Parallelized Ontology Network-Based Semantic Similarity Measure for Big Biomedical Document Clustering
title_short An Efficient Parallelized Ontology Network-Based Semantic Similarity Measure for Big Biomedical Document Clustering
title_sort efficient parallelized ontology network-based semantic similarity measure for big biomedical document clustering
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8594978/
https://www.ncbi.nlm.nih.gov/pubmed/34795792
http://dx.doi.org/10.1155/2021/7937573
work_keys_str_mv AT limeijing anefficientparallelizedontologynetworkbasedsemanticsimilaritymeasureforbigbiomedicaldocumentclustering
AT chentianjie anefficientparallelizedontologynetworkbasedsemanticsimilaritymeasureforbigbiomedicaldocumentclustering
AT ryukeunho anefficientparallelizedontologynetworkbasedsemanticsimilaritymeasureforbigbiomedicaldocumentclustering
AT jinchenghao anefficientparallelizedontologynetworkbasedsemanticsimilaritymeasureforbigbiomedicaldocumentclustering
AT limeijing efficientparallelizedontologynetworkbasedsemanticsimilaritymeasureforbigbiomedicaldocumentclustering
AT chentianjie efficientparallelizedontologynetworkbasedsemanticsimilaritymeasureforbigbiomedicaldocumentclustering
AT ryukeunho efficientparallelizedontologynetworkbasedsemanticsimilaritymeasureforbigbiomedicaldocumentclustering
AT jinchenghao efficientparallelizedontologynetworkbasedsemanticsimilaritymeasureforbigbiomedicaldocumentclustering