Cargando…

Broad-coverage biomedical relation extraction with SemRep

BACKGROUND: In the era of information overload, natural language processing (NLP) techniques are increasingly needed to support advanced biomedical information management and discovery applications. In this paper, we present an in-depth description of SemRep, an NLP system that extracts semantic rel...

Descripción completa

Detalles Bibliográficos
Autores principales: Kilicoglu, Halil, Rosemblat, Graciela, Fiszman, Marcelo, Shin, Dongwook
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7222583/
https://www.ncbi.nlm.nih.gov/pubmed/32410573
http://dx.doi.org/10.1186/s12859-020-3517-7
_version_ 1783533607282802688
author Kilicoglu, Halil
Rosemblat, Graciela
Fiszman, Marcelo
Shin, Dongwook
author_facet Kilicoglu, Halil
Rosemblat, Graciela
Fiszman, Marcelo
Shin, Dongwook
author_sort Kilicoglu, Halil
collection PubMed
description BACKGROUND: In the era of information overload, natural language processing (NLP) techniques are increasingly needed to support advanced biomedical information management and discovery applications. In this paper, we present an in-depth description of SemRep, an NLP system that extracts semantic relations from PubMed abstracts using linguistic principles and UMLS domain knowledge. We also evaluate SemRep on two datasets. In one evaluation, we use a manually annotated test collection and perform a comprehensive error analysis. In another evaluation, we assess SemRep’s performance on the CDR dataset, a standard benchmark corpus annotated with causal chemical-disease relationships. RESULTS: A strict evaluation of SemRep on our manually annotated dataset yields 0.55 precision, 0.34 recall, and 0.42 F (1) score. A relaxed evaluation, which more accurately characterizes SemRep performance, yields 0.69 precision, 0.42 recall, and 0.52 F (1) score. An error analysis reveals named entity recognition/normalization as the largest source of errors (26.9%), followed by argument identification (14%) and trigger detection errors (12.5%). The evaluation on the CDR corpus yields 0.90 precision, 0.24 recall, and 0.38 F (1) score. The recall and the F (1) score increase to 0.35 and 0.50, respectively, when the evaluation on this corpus is limited to sentence-bound relationships, which represents a fairer evaluation, as SemRep operates at the sentence level. CONCLUSIONS: SemRep is a broad-coverage, interpretable, strong baseline system for extracting semantic relations from biomedical text. It also underpins SemMedDB, a literature-scale knowledge graph based on semantic relations. Through SemMedDB, SemRep has had significant impact in the scientific community, supporting a variety of clinical and translational applications, including clinical decision making, medical diagnosis, drug repurposing, literature-based discovery and hypothesis generation, and contributing to improved health outcomes. In ongoing development, we are redesigning SemRep to increase its modularity and flexibility, and addressing weaknesses identified in the error analysis.
format Online
Article
Text
id pubmed-7222583
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-72225832020-05-27 Broad-coverage biomedical relation extraction with SemRep Kilicoglu, Halil Rosemblat, Graciela Fiszman, Marcelo Shin, Dongwook BMC Bioinformatics Software BACKGROUND: In the era of information overload, natural language processing (NLP) techniques are increasingly needed to support advanced biomedical information management and discovery applications. In this paper, we present an in-depth description of SemRep, an NLP system that extracts semantic relations from PubMed abstracts using linguistic principles and UMLS domain knowledge. We also evaluate SemRep on two datasets. In one evaluation, we use a manually annotated test collection and perform a comprehensive error analysis. In another evaluation, we assess SemRep’s performance on the CDR dataset, a standard benchmark corpus annotated with causal chemical-disease relationships. RESULTS: A strict evaluation of SemRep on our manually annotated dataset yields 0.55 precision, 0.34 recall, and 0.42 F (1) score. A relaxed evaluation, which more accurately characterizes SemRep performance, yields 0.69 precision, 0.42 recall, and 0.52 F (1) score. An error analysis reveals named entity recognition/normalization as the largest source of errors (26.9%), followed by argument identification (14%) and trigger detection errors (12.5%). The evaluation on the CDR corpus yields 0.90 precision, 0.24 recall, and 0.38 F (1) score. The recall and the F (1) score increase to 0.35 and 0.50, respectively, when the evaluation on this corpus is limited to sentence-bound relationships, which represents a fairer evaluation, as SemRep operates at the sentence level. CONCLUSIONS: SemRep is a broad-coverage, interpretable, strong baseline system for extracting semantic relations from biomedical text. It also underpins SemMedDB, a literature-scale knowledge graph based on semantic relations. Through SemMedDB, SemRep has had significant impact in the scientific community, supporting a variety of clinical and translational applications, including clinical decision making, medical diagnosis, drug repurposing, literature-based discovery and hypothesis generation, and contributing to improved health outcomes. In ongoing development, we are redesigning SemRep to increase its modularity and flexibility, and addressing weaknesses identified in the error analysis. BioMed Central 2020-05-14 /pmc/articles/PMC7222583/ /pubmed/32410573 http://dx.doi.org/10.1186/s12859-020-3517-7 Text en © The Author(s) 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visithttp://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Software
Kilicoglu, Halil
Rosemblat, Graciela
Fiszman, Marcelo
Shin, Dongwook
Broad-coverage biomedical relation extraction with SemRep
title Broad-coverage biomedical relation extraction with SemRep
title_full Broad-coverage biomedical relation extraction with SemRep
title_fullStr Broad-coverage biomedical relation extraction with SemRep
title_full_unstemmed Broad-coverage biomedical relation extraction with SemRep
title_short Broad-coverage biomedical relation extraction with SemRep
title_sort broad-coverage biomedical relation extraction with semrep
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7222583/
https://www.ncbi.nlm.nih.gov/pubmed/32410573
http://dx.doi.org/10.1186/s12859-020-3517-7
work_keys_str_mv AT kilicogluhalil broadcoveragebiomedicalrelationextractionwithsemrep
AT rosemblatgraciela broadcoveragebiomedicalrelationextractionwithsemrep
AT fiszmanmarcelo broadcoveragebiomedicalrelationextractionwithsemrep
AT shindongwook broadcoveragebiomedicalrelationextractionwithsemrep