Cargando…

BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale

A massive number of biological entities, such as genes and mutations, are mentioned in the biomedical literature. The capturing of the semantic relatedness of biological entities is vital to many biological applications, such as protein-protein interaction prediction and literature-based discovery....

Descripción completa

Detalles Bibliográficos
Autores principales:	Chen, Qingyu, Lee, Kyubum, Yan, Shankai, Kim, Sun, Wei, Chih-Hsuan, Lu, Zhiyong
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2020
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7237030/ https://www.ncbi.nlm.nih.gov/pubmed/32324731 http://dx.doi.org/10.1371/journal.pcbi.1007617

_version_	1783536254381457408
author	Chen, Qingyu Lee, Kyubum Yan, Shankai Kim, Sun Wei, Chih-Hsuan Lu, Zhiyong
author_facet	Chen, Qingyu Lee, Kyubum Yan, Shankai Kim, Sun Wei, Chih-Hsuan Lu, Zhiyong
author_sort	Chen, Qingyu
collection	PubMed
description	A massive number of biological entities, such as genes and mutations, are mentioned in the biomedical literature. The capturing of the semantic relatedness of biological entities is vital to many biological applications, such as protein-protein interaction prediction and literature-based discovery. Concept embeddings—which involve the learning of vector representations of concepts using machine learning models—have been employed to capture the semantics of concepts. To develop concept embeddings, named-entity recognition (NER) tools are first used to identify and normalize concepts from the literature, and then different machine learning models are used to train the embeddings. Despite multiple attempts, existing biomedical concept embeddings generally suffer from suboptimal NER tools, small-scale evaluation, and limited availability. In response, we employed high-performance machine learning-based NER tools for concept recognition and trained our concept embeddings, BioConceptVec, via four different machine learning models on ~30 million PubMed abstracts. BioConceptVec covers over 400,000 biomedical concepts mentioned in the literature and is of the largest among the publicly available biomedical concept embeddings to date. To evaluate the validity and utility of BioConceptVec, we respectively performed two intrinsic evaluations (identifying related concepts based on drug-gene and gene-gene interactions) and two extrinsic evaluations (protein-protein interaction prediction and drug-drug interaction extraction), collectively using over 25 million instances from nine independent datasets (17 million instances from six intrinsic evaluation tasks and 8 million instances from three extrinsic evaluation tasks), which is, by far, the most comprehensive to our best knowledge. The intrinsic evaluation results demonstrate that BioConceptVec consistently has, by a large margin, better performance than existing concept embeddings in identifying similar and related concepts. More importantly, the extrinsic evaluation results demonstrate that using BioConceptVec with advanced deep learning models can significantly improve performance in downstream bioinformatics studies and biomedical text-mining applications. Our BioConceptVec embeddings and benchmarking datasets are publicly available at https://github.com/ncbi-nlp/BioConceptVec.
format	Online Article Text
id	pubmed-7237030
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-72370302020-06-03 BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale Chen, Qingyu Lee, Kyubum Yan, Shankai Kim, Sun Wei, Chih-Hsuan Lu, Zhiyong PLoS Comput Biol Research Article A massive number of biological entities, such as genes and mutations, are mentioned in the biomedical literature. The capturing of the semantic relatedness of biological entities is vital to many biological applications, such as protein-protein interaction prediction and literature-based discovery. Concept embeddings—which involve the learning of vector representations of concepts using machine learning models—have been employed to capture the semantics of concepts. To develop concept embeddings, named-entity recognition (NER) tools are first used to identify and normalize concepts from the literature, and then different machine learning models are used to train the embeddings. Despite multiple attempts, existing biomedical concept embeddings generally suffer from suboptimal NER tools, small-scale evaluation, and limited availability. In response, we employed high-performance machine learning-based NER tools for concept recognition and trained our concept embeddings, BioConceptVec, via four different machine learning models on ~30 million PubMed abstracts. BioConceptVec covers over 400,000 biomedical concepts mentioned in the literature and is of the largest among the publicly available biomedical concept embeddings to date. To evaluate the validity and utility of BioConceptVec, we respectively performed two intrinsic evaluations (identifying related concepts based on drug-gene and gene-gene interactions) and two extrinsic evaluations (protein-protein interaction prediction and drug-drug interaction extraction), collectively using over 25 million instances from nine independent datasets (17 million instances from six intrinsic evaluation tasks and 8 million instances from three extrinsic evaluation tasks), which is, by far, the most comprehensive to our best knowledge. The intrinsic evaluation results demonstrate that BioConceptVec consistently has, by a large margin, better performance than existing concept embeddings in identifying similar and related concepts. More importantly, the extrinsic evaluation results demonstrate that using BioConceptVec with advanced deep learning models can significantly improve performance in downstream bioinformatics studies and biomedical text-mining applications. Our BioConceptVec embeddings and benchmarking datasets are publicly available at https://github.com/ncbi-nlp/BioConceptVec. Public Library of Science 2020-04-23 /pmc/articles/PMC7237030/ /pubmed/32324731 http://dx.doi.org/10.1371/journal.pcbi.1007617 Text en https://creativecommons.org/publicdomain/zero/1.0/ This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 (https://creativecommons.org/publicdomain/zero/1.0/) public domain dedication.
spellingShingle	Research Article Chen, Qingyu Lee, Kyubum Yan, Shankai Kim, Sun Wei, Chih-Hsuan Lu, Zhiyong BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale
title	BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale
title_full	BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale
title_fullStr	BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale
title_full_unstemmed	BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale
title_short	BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale
title_sort	bioconceptvec: creating and evaluating literature-based biomedical concept embeddings on a large scale
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7237030/ https://www.ncbi.nlm.nih.gov/pubmed/32324731 http://dx.doi.org/10.1371/journal.pcbi.1007617
work_keys_str_mv	AT chenqingyu bioconceptveccreatingandevaluatingliteraturebasedbiomedicalconceptembeddingsonalargescale AT leekyubum bioconceptveccreatingandevaluatingliteraturebasedbiomedicalconceptembeddingsonalargescale AT yanshankai bioconceptveccreatingandevaluatingliteraturebasedbiomedicalconceptembeddingsonalargescale AT kimsun bioconceptveccreatingandevaluatingliteraturebasedbiomedicalconceptembeddingsonalargescale AT weichihhsuan bioconceptveccreatingandevaluatingliteraturebasedbiomedicalconceptembeddingsonalargescale AT luzhiyong bioconceptveccreatingandevaluatingliteraturebasedbiomedicalconceptembeddingsonalargescale

BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale

Ejemplares similares