Cargando…

A method of inferring the relationship between Biomedical entities through correlation analysis on text

BACKGROUND: One of the most important processes in a machine learning-based natural language processing is to represent words. The one-hot representation that has been commonly used has a large size of vector and assumes that the features that make up the vector are independent of each other. On the...

Descripción completa

Detalles Bibliográficos
Autores principales: Song, Hye-Jeong, Yoon, Byeong-Hun, Youn, Young-Shin, Park, Chan-Young, Kim, Jong-Dae, Kim, Yu-Seop
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6218997/
https://www.ncbi.nlm.nih.gov/pubmed/30396345
http://dx.doi.org/10.1186/s12938-018-0583-4
_version_ 1783368561365876736
author Song, Hye-Jeong
Yoon, Byeong-Hun
Youn, Young-Shin
Park, Chan-Young
Kim, Jong-Dae
Kim, Yu-Seop
author_facet Song, Hye-Jeong
Yoon, Byeong-Hun
Youn, Young-Shin
Park, Chan-Young
Kim, Jong-Dae
Kim, Yu-Seop
author_sort Song, Hye-Jeong
collection PubMed
description BACKGROUND: One of the most important processes in a machine learning-based natural language processing is to represent words. The one-hot representation that has been commonly used has a large size of vector and assumes that the features that make up the vector are independent of each other. On the other hand, it is known that word embedding has a great effect in estimating the similarity between words because it expresses the meaning of the word well. In this study, we try to clarify the correlation between various terms in the biomedical texts based on the excellent ability of estimating similarity between words shown by word embedding. Therefore, we used word embedding to find new biomarkers and microorganisms related to a specific diseases. METHODS: In this study, we try to analyze the correlation between diseases-markers and diseases-microorganisms. First, we need to construct a corpus that seems to be related to them. To do this, we extract the titles and abstracts from the biomedical texts on the PubMed site. Second, we express diseases, markers, and microorganisms’ terms in word embedding using Canonical Correlation Analysis (CCA). CCA is a statistical based methodology that has a very good performance on vector dimension reduction. Finally, we tried to estimate the relationship between diseases-markers pairs and diseases-microorganisms pairs by measuring their similarity. RESULTS: In the experiment, we tried to confirm the correlation derived through word embedding using Google Scholar search results. Of the top 20 highly correlated disease-marker pairs, about 85% of the pairs have actually undergone a lot of research as a result of Google Scholars search. Conversely, for 85% of the 20 pairs with the lowest correlation, we could not actually find any other study to determine the relationship between the disease and the marker. This trend was similar for disease-microbe pairs. CONCLUSIONS: The correlation between diseases and markers and diseases and microorganisms calculated through word embedding reflects actual research trends. If the word-embedding correlation is high, but there are not many published actual studies, additional research can be proposed for the pair.
format Online
Article
Text
id pubmed-6218997
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-62189972018-11-08 A method of inferring the relationship between Biomedical entities through correlation analysis on text Song, Hye-Jeong Yoon, Byeong-Hun Youn, Young-Shin Park, Chan-Young Kim, Jong-Dae Kim, Yu-Seop Biomed Eng Online Research BACKGROUND: One of the most important processes in a machine learning-based natural language processing is to represent words. The one-hot representation that has been commonly used has a large size of vector and assumes that the features that make up the vector are independent of each other. On the other hand, it is known that word embedding has a great effect in estimating the similarity between words because it expresses the meaning of the word well. In this study, we try to clarify the correlation between various terms in the biomedical texts based on the excellent ability of estimating similarity between words shown by word embedding. Therefore, we used word embedding to find new biomarkers and microorganisms related to a specific diseases. METHODS: In this study, we try to analyze the correlation between diseases-markers and diseases-microorganisms. First, we need to construct a corpus that seems to be related to them. To do this, we extract the titles and abstracts from the biomedical texts on the PubMed site. Second, we express diseases, markers, and microorganisms’ terms in word embedding using Canonical Correlation Analysis (CCA). CCA is a statistical based methodology that has a very good performance on vector dimension reduction. Finally, we tried to estimate the relationship between diseases-markers pairs and diseases-microorganisms pairs by measuring their similarity. RESULTS: In the experiment, we tried to confirm the correlation derived through word embedding using Google Scholar search results. Of the top 20 highly correlated disease-marker pairs, about 85% of the pairs have actually undergone a lot of research as a result of Google Scholars search. Conversely, for 85% of the 20 pairs with the lowest correlation, we could not actually find any other study to determine the relationship between the disease and the marker. This trend was similar for disease-microbe pairs. CONCLUSIONS: The correlation between diseases and markers and diseases and microorganisms calculated through word embedding reflects actual research trends. If the word-embedding correlation is high, but there are not many published actual studies, additional research can be proposed for the pair. BioMed Central 2018-11-06 /pmc/articles/PMC6218997/ /pubmed/30396345 http://dx.doi.org/10.1186/s12938-018-0583-4 Text en © The Author(s) 2018 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Song, Hye-Jeong
Yoon, Byeong-Hun
Youn, Young-Shin
Park, Chan-Young
Kim, Jong-Dae
Kim, Yu-Seop
A method of inferring the relationship between Biomedical entities through correlation analysis on text
title A method of inferring the relationship between Biomedical entities through correlation analysis on text
title_full A method of inferring the relationship between Biomedical entities through correlation analysis on text
title_fullStr A method of inferring the relationship between Biomedical entities through correlation analysis on text
title_full_unstemmed A method of inferring the relationship between Biomedical entities through correlation analysis on text
title_short A method of inferring the relationship between Biomedical entities through correlation analysis on text
title_sort method of inferring the relationship between biomedical entities through correlation analysis on text
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6218997/
https://www.ncbi.nlm.nih.gov/pubmed/30396345
http://dx.doi.org/10.1186/s12938-018-0583-4
work_keys_str_mv AT songhyejeong amethodofinferringtherelationshipbetweenbiomedicalentitiesthroughcorrelationanalysisontext
AT yoonbyeonghun amethodofinferringtherelationshipbetweenbiomedicalentitiesthroughcorrelationanalysisontext
AT younyoungshin amethodofinferringtherelationshipbetweenbiomedicalentitiesthroughcorrelationanalysisontext
AT parkchanyoung amethodofinferringtherelationshipbetweenbiomedicalentitiesthroughcorrelationanalysisontext
AT kimjongdae amethodofinferringtherelationshipbetweenbiomedicalentitiesthroughcorrelationanalysisontext
AT kimyuseop amethodofinferringtherelationshipbetweenbiomedicalentitiesthroughcorrelationanalysisontext
AT songhyejeong methodofinferringtherelationshipbetweenbiomedicalentitiesthroughcorrelationanalysisontext
AT yoonbyeonghun methodofinferringtherelationshipbetweenbiomedicalentitiesthroughcorrelationanalysisontext
AT younyoungshin methodofinferringtherelationshipbetweenbiomedicalentitiesthroughcorrelationanalysisontext
AT parkchanyoung methodofinferringtherelationshipbetweenbiomedicalentitiesthroughcorrelationanalysisontext
AT kimjongdae methodofinferringtherelationshipbetweenbiomedicalentitiesthroughcorrelationanalysisontext
AT kimyuseop methodofinferringtherelationshipbetweenbiomedicalentitiesthroughcorrelationanalysisontext