Cargando…

Entity linking for biomedical literature

BACKGROUND: The Entity Linking (EL) task links entity mentions from an unstructured document to entities in a knowledge base. Although this problem is well-studied in news and social media, this problem has not received much attention in the life science domain. One outcome of tackling the EL proble...

Descripción completa

Detalles Bibliográficos
Autores principales: Zheng, Jin G, Howsmon, Daniel, Zhang, Boliang, Hahn, Juergen, McGuinness, Deborah, Hendler, James, Ji, Heng
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4460707/
https://www.ncbi.nlm.nih.gov/pubmed/26045232
http://dx.doi.org/10.1186/1472-6947-15-S1-S4
_version_ 1782375419548270592
author Zheng, Jin G
Howsmon, Daniel
Zhang, Boliang
Hahn, Juergen
McGuinness, Deborah
Hendler, James
Ji, Heng
author_facet Zheng, Jin G
Howsmon, Daniel
Zhang, Boliang
Hahn, Juergen
McGuinness, Deborah
Hendler, James
Ji, Heng
author_sort Zheng, Jin G
collection PubMed
description BACKGROUND: The Entity Linking (EL) task links entity mentions from an unstructured document to entities in a knowledge base. Although this problem is well-studied in news and social media, this problem has not received much attention in the life science domain. One outcome of tackling the EL problem in the life sciences domain is to enable scientists to build computational models of biological processes with more efficiency. However, simply applying a news-trained entity linker produces inadequate results. METHODS: Since existing supervised approaches require a large amount of manually-labeled training data, which is currently unavailable for the life science domain, we propose a novel unsupervised collective inference approach to link entities from unstructured full texts of biomedical literature to 300 ontologies. The approach leverages the rich semantic information and structures in ontologies for similarity computation and entity ranking. RESULTS: Without using any manual annotation, our approach significantly outperforms state-of-the-art supervised EL method (9% absolute gain in linking accuracy). Furthermore, the state-of-the-art supervised EL method requires 15,000 manually annotated entity mentions for training. These promising results establish a benchmark for the EL task in the life science domain. We also provide in depth analysis and discussion on both challenges and opportunities on automatic knowledge enrichment for scientific literature. CONCLUSIONS: In this paper, we propose a novel unsupervised collective inference approach to address the EL problem in a new domain. We show that our unsupervised approach is able to outperform a current state-of-the-art supervised approach that has been trained with a large amount of manually labeled data. Life science presents an underrepresented domain for applying EL techniques. By providing a small benchmark data set and identifying opportunities, we hope to stimulate discussions across natural language processing and bioinformatics and motivate others to develop techniques for this largely untapped domain.
format Online
Article
Text
id pubmed-4460707
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-44607072015-06-29 Entity linking for biomedical literature Zheng, Jin G Howsmon, Daniel Zhang, Boliang Hahn, Juergen McGuinness, Deborah Hendler, James Ji, Heng BMC Med Inform Decis Mak Research Article BACKGROUND: The Entity Linking (EL) task links entity mentions from an unstructured document to entities in a knowledge base. Although this problem is well-studied in news and social media, this problem has not received much attention in the life science domain. One outcome of tackling the EL problem in the life sciences domain is to enable scientists to build computational models of biological processes with more efficiency. However, simply applying a news-trained entity linker produces inadequate results. METHODS: Since existing supervised approaches require a large amount of manually-labeled training data, which is currently unavailable for the life science domain, we propose a novel unsupervised collective inference approach to link entities from unstructured full texts of biomedical literature to 300 ontologies. The approach leverages the rich semantic information and structures in ontologies for similarity computation and entity ranking. RESULTS: Without using any manual annotation, our approach significantly outperforms state-of-the-art supervised EL method (9% absolute gain in linking accuracy). Furthermore, the state-of-the-art supervised EL method requires 15,000 manually annotated entity mentions for training. These promising results establish a benchmark for the EL task in the life science domain. We also provide in depth analysis and discussion on both challenges and opportunities on automatic knowledge enrichment for scientific literature. CONCLUSIONS: In this paper, we propose a novel unsupervised collective inference approach to address the EL problem in a new domain. We show that our unsupervised approach is able to outperform a current state-of-the-art supervised approach that has been trained with a large amount of manually labeled data. Life science presents an underrepresented domain for applying EL techniques. By providing a small benchmark data set and identifying opportunities, we hope to stimulate discussions across natural language processing and bioinformatics and motivate others to develop techniques for this largely untapped domain. BioMed Central 2015-05-20 /pmc/articles/PMC4460707/ /pubmed/26045232 http://dx.doi.org/10.1186/1472-6947-15-S1-S4 Text en Copyright © 2015 Zheng et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Zheng, Jin G
Howsmon, Daniel
Zhang, Boliang
Hahn, Juergen
McGuinness, Deborah
Hendler, James
Ji, Heng
Entity linking for biomedical literature
title Entity linking for biomedical literature
title_full Entity linking for biomedical literature
title_fullStr Entity linking for biomedical literature
title_full_unstemmed Entity linking for biomedical literature
title_short Entity linking for biomedical literature
title_sort entity linking for biomedical literature
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4460707/
https://www.ncbi.nlm.nih.gov/pubmed/26045232
http://dx.doi.org/10.1186/1472-6947-15-S1-S4
work_keys_str_mv AT zhengjing entitylinkingforbiomedicalliterature
AT howsmondaniel entitylinkingforbiomedicalliterature
AT zhangboliang entitylinkingforbiomedicalliterature
AT hahnjuergen entitylinkingforbiomedicalliterature
AT mcguinnessdeborah entitylinkingforbiomedicalliterature
AT hendlerjames entitylinkingforbiomedicalliterature
AT jiheng entitylinkingforbiomedicalliterature