Cargando…

Constructing co-occurrence network embeddings to assist association extraction for COVID-19 and other coronavirus infectious diseases

OBJECTIVE: As coronavirus disease 2019 (COVID-19) started its rapid emergence and gradually transformed into an unprecedented pandemic, the need for having a knowledge repository for the disease became crucial. To address this issue, a new COVID-19 machine-readable dataset known as the COVID-19 Open...

Descripción completa

Detalles Bibliográficos
Autores principales: Oniani, David, Jiang, Guoqian, Liu, Hongfang, Shen, Feichen
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7314034/
https://www.ncbi.nlm.nih.gov/pubmed/32458963
http://dx.doi.org/10.1093/jamia/ocaa117
_version_ 1783550022361546752
author Oniani, David
Jiang, Guoqian
Liu, Hongfang
Shen, Feichen
author_facet Oniani, David
Jiang, Guoqian
Liu, Hongfang
Shen, Feichen
author_sort Oniani, David
collection PubMed
description OBJECTIVE: As coronavirus disease 2019 (COVID-19) started its rapid emergence and gradually transformed into an unprecedented pandemic, the need for having a knowledge repository for the disease became crucial. To address this issue, a new COVID-19 machine-readable dataset known as the COVID-19 Open Research Dataset (CORD-19) has been released. Based on this, our objective was to build a computable co-occurrence network embeddings to assist association detection among COVID-19–related biomedical entities. MATERIALS AND METHODS: Leveraging a Linked Data version of CORD-19 (ie, CORD-19-on-FHIR), we first utilized SPARQL to extract co-occurrences among chemicals, diseases, genes, and mutations and build a co-occurrence network. We then trained the representation of the derived co-occurrence network using node2vec with 4 edge embeddings operations (L1, L2, Average, and Hadamard). Six algorithms (decision tree, logistic regression, support vector machine, random forest, naïve Bayes, and multilayer perceptron) were applied to evaluate performance on link prediction. An unsupervised learning strategy was also developed incorporating the t-SNE (t-distributed stochastic neighbor embedding) and DBSCAN (density-based spatial clustering of applications with noise) algorithms for case studies. RESULTS: The random forest classifier showed the best performance on link prediction across different network embeddings. For edge embeddings generated using the Average operation, random forest achieved the optimal average precision of 0.97 along with a F1 score of 0.90. For unsupervised learning, 63 clusters were formed with silhouette score of 0.128. Significant associations were detected for 5 coronavirus infectious diseases in their corresponding subgroups. CONCLUSIONS: In this study, we constructed COVID-19–centered co-occurrence network embeddings. Results indicated that the generated embeddings were able to extract significant associations for COVID-19 and coronavirus infectious diseases.
format Online
Article
Text
id pubmed-7314034
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-73140342020-06-25 Constructing co-occurrence network embeddings to assist association extraction for COVID-19 and other coronavirus infectious diseases Oniani, David Jiang, Guoqian Liu, Hongfang Shen, Feichen J Am Med Inform Assoc Research and Applications OBJECTIVE: As coronavirus disease 2019 (COVID-19) started its rapid emergence and gradually transformed into an unprecedented pandemic, the need for having a knowledge repository for the disease became crucial. To address this issue, a new COVID-19 machine-readable dataset known as the COVID-19 Open Research Dataset (CORD-19) has been released. Based on this, our objective was to build a computable co-occurrence network embeddings to assist association detection among COVID-19–related biomedical entities. MATERIALS AND METHODS: Leveraging a Linked Data version of CORD-19 (ie, CORD-19-on-FHIR), we first utilized SPARQL to extract co-occurrences among chemicals, diseases, genes, and mutations and build a co-occurrence network. We then trained the representation of the derived co-occurrence network using node2vec with 4 edge embeddings operations (L1, L2, Average, and Hadamard). Six algorithms (decision tree, logistic regression, support vector machine, random forest, naïve Bayes, and multilayer perceptron) were applied to evaluate performance on link prediction. An unsupervised learning strategy was also developed incorporating the t-SNE (t-distributed stochastic neighbor embedding) and DBSCAN (density-based spatial clustering of applications with noise) algorithms for case studies. RESULTS: The random forest classifier showed the best performance on link prediction across different network embeddings. For edge embeddings generated using the Average operation, random forest achieved the optimal average precision of 0.97 along with a F1 score of 0.90. For unsupervised learning, 63 clusters were formed with silhouette score of 0.128. Significant associations were detected for 5 coronavirus infectious diseases in their corresponding subgroups. CONCLUSIONS: In this study, we constructed COVID-19–centered co-occurrence network embeddings. Results indicated that the generated embeddings were able to extract significant associations for COVID-19 and coronavirus infectious diseases. Oxford University Press 2020-05-27 /pmc/articles/PMC7314034/ /pubmed/32458963 http://dx.doi.org/10.1093/jamia/ocaa117 Text en © The Author(s) 2020. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For permissions, please email: journals.permissions@oup.com https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
spellingShingle Research and Applications
Oniani, David
Jiang, Guoqian
Liu, Hongfang
Shen, Feichen
Constructing co-occurrence network embeddings to assist association extraction for COVID-19 and other coronavirus infectious diseases
title Constructing co-occurrence network embeddings to assist association extraction for COVID-19 and other coronavirus infectious diseases
title_full Constructing co-occurrence network embeddings to assist association extraction for COVID-19 and other coronavirus infectious diseases
title_fullStr Constructing co-occurrence network embeddings to assist association extraction for COVID-19 and other coronavirus infectious diseases
title_full_unstemmed Constructing co-occurrence network embeddings to assist association extraction for COVID-19 and other coronavirus infectious diseases
title_short Constructing co-occurrence network embeddings to assist association extraction for COVID-19 and other coronavirus infectious diseases
title_sort constructing co-occurrence network embeddings to assist association extraction for covid-19 and other coronavirus infectious diseases
topic Research and Applications
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7314034/
https://www.ncbi.nlm.nih.gov/pubmed/32458963
http://dx.doi.org/10.1093/jamia/ocaa117
work_keys_str_mv AT onianidavid constructingcooccurrencenetworkembeddingstoassistassociationextractionforcovid19andothercoronavirusinfectiousdiseases
AT jiangguoqian constructingcooccurrencenetworkembeddingstoassistassociationextractionforcovid19andothercoronavirusinfectiousdiseases
AT liuhongfang constructingcooccurrencenetworkembeddingstoassistassociationextractionforcovid19andothercoronavirusinfectiousdiseases
AT shenfeichen constructingcooccurrencenetworkembeddingstoassistassociationextractionforcovid19andothercoronavirusinfectiousdiseases