Cargando…

BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale

A massive number of biological entities, such as genes and mutations, are mentioned in the biomedical literature. The capturing of the semantic relatedness of biological entities is vital to many biological applications, such as protein-protein interaction prediction and literature-based discovery....

Descripción completa

Detalles Bibliográficos
Autores principales: Chen, Qingyu, Lee, Kyubum, Yan, Shankai, Kim, Sun, Wei, Chih-Hsuan, Lu, Zhiyong
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7237030/
https://www.ncbi.nlm.nih.gov/pubmed/32324731
http://dx.doi.org/10.1371/journal.pcbi.1007617
_version_ 1783536254381457408
author Chen, Qingyu
Lee, Kyubum
Yan, Shankai
Kim, Sun
Wei, Chih-Hsuan
Lu, Zhiyong
author_facet Chen, Qingyu
Lee, Kyubum
Yan, Shankai
Kim, Sun
Wei, Chih-Hsuan
Lu, Zhiyong
author_sort Chen, Qingyu
collection PubMed
description A massive number of biological entities, such as genes and mutations, are mentioned in the biomedical literature. The capturing of the semantic relatedness of biological entities is vital to many biological applications, such as protein-protein interaction prediction and literature-based discovery. Concept embeddings—which involve the learning of vector representations of concepts using machine learning models—have been employed to capture the semantics of concepts. To develop concept embeddings, named-entity recognition (NER) tools are first used to identify and normalize concepts from the literature, and then different machine learning models are used to train the embeddings. Despite multiple attempts, existing biomedical concept embeddings generally suffer from suboptimal NER tools, small-scale evaluation, and limited availability. In response, we employed high-performance machine learning-based NER tools for concept recognition and trained our concept embeddings, BioConceptVec, via four different machine learning models on ~30 million PubMed abstracts. BioConceptVec covers over 400,000 biomedical concepts mentioned in the literature and is of the largest among the publicly available biomedical concept embeddings to date. To evaluate the validity and utility of BioConceptVec, we respectively performed two intrinsic evaluations (identifying related concepts based on drug-gene and gene-gene interactions) and two extrinsic evaluations (protein-protein interaction prediction and drug-drug interaction extraction), collectively using over 25 million instances from nine independent datasets (17 million instances from six intrinsic evaluation tasks and 8 million instances from three extrinsic evaluation tasks), which is, by far, the most comprehensive to our best knowledge. The intrinsic evaluation results demonstrate that BioConceptVec consistently has, by a large margin, better performance than existing concept embeddings in identifying similar and related concepts. More importantly, the extrinsic evaluation results demonstrate that using BioConceptVec with advanced deep learning models can significantly improve performance in downstream bioinformatics studies and biomedical text-mining applications. Our BioConceptVec embeddings and benchmarking datasets are publicly available at https://github.com/ncbi-nlp/BioConceptVec.
format Online
Article
Text
id pubmed-7237030
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-72370302020-06-03 BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale Chen, Qingyu Lee, Kyubum Yan, Shankai Kim, Sun Wei, Chih-Hsuan Lu, Zhiyong PLoS Comput Biol Research Article A massive number of biological entities, such as genes and mutations, are mentioned in the biomedical literature. The capturing of the semantic relatedness of biological entities is vital to many biological applications, such as protein-protein interaction prediction and literature-based discovery. Concept embeddings—which involve the learning of vector representations of concepts using machine learning models—have been employed to capture the semantics of concepts. To develop concept embeddings, named-entity recognition (NER) tools are first used to identify and normalize concepts from the literature, and then different machine learning models are used to train the embeddings. Despite multiple attempts, existing biomedical concept embeddings generally suffer from suboptimal NER tools, small-scale evaluation, and limited availability. In response, we employed high-performance machine learning-based NER tools for concept recognition and trained our concept embeddings, BioConceptVec, via four different machine learning models on ~30 million PubMed abstracts. BioConceptVec covers over 400,000 biomedical concepts mentioned in the literature and is of the largest among the publicly available biomedical concept embeddings to date. To evaluate the validity and utility of BioConceptVec, we respectively performed two intrinsic evaluations (identifying related concepts based on drug-gene and gene-gene interactions) and two extrinsic evaluations (protein-protein interaction prediction and drug-drug interaction extraction), collectively using over 25 million instances from nine independent datasets (17 million instances from six intrinsic evaluation tasks and 8 million instances from three extrinsic evaluation tasks), which is, by far, the most comprehensive to our best knowledge. The intrinsic evaluation results demonstrate that BioConceptVec consistently has, by a large margin, better performance than existing concept embeddings in identifying similar and related concepts. More importantly, the extrinsic evaluation results demonstrate that using BioConceptVec with advanced deep learning models can significantly improve performance in downstream bioinformatics studies and biomedical text-mining applications. Our BioConceptVec embeddings and benchmarking datasets are publicly available at https://github.com/ncbi-nlp/BioConceptVec. Public Library of Science 2020-04-23 /pmc/articles/PMC7237030/ /pubmed/32324731 http://dx.doi.org/10.1371/journal.pcbi.1007617 Text en https://creativecommons.org/publicdomain/zero/1.0/ This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 (https://creativecommons.org/publicdomain/zero/1.0/) public domain dedication.
spellingShingle Research Article
Chen, Qingyu
Lee, Kyubum
Yan, Shankai
Kim, Sun
Wei, Chih-Hsuan
Lu, Zhiyong
BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale
title BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale
title_full BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale
title_fullStr BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale
title_full_unstemmed BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale
title_short BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale
title_sort bioconceptvec: creating and evaluating literature-based biomedical concept embeddings on a large scale
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7237030/
https://www.ncbi.nlm.nih.gov/pubmed/32324731
http://dx.doi.org/10.1371/journal.pcbi.1007617
work_keys_str_mv AT chenqingyu bioconceptveccreatingandevaluatingliteraturebasedbiomedicalconceptembeddingsonalargescale
AT leekyubum bioconceptveccreatingandevaluatingliteraturebasedbiomedicalconceptembeddingsonalargescale
AT yanshankai bioconceptveccreatingandevaluatingliteraturebasedbiomedicalconceptembeddingsonalargescale
AT kimsun bioconceptveccreatingandevaluatingliteraturebasedbiomedicalconceptembeddingsonalargescale
AT weichihhsuan bioconceptveccreatingandevaluatingliteraturebasedbiomedicalconceptembeddingsonalargescale
AT luzhiyong bioconceptveccreatingandevaluatingliteraturebasedbiomedicalconceptembeddingsonalargescale