Cargando…
Broad-coverage biomedical relation extraction with SemRep
BACKGROUND: In the era of information overload, natural language processing (NLP) techniques are increasingly needed to support advanced biomedical information management and discovery applications. In this paper, we present an in-depth description of SemRep, an NLP system that extracts semantic rel...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7222583/ https://www.ncbi.nlm.nih.gov/pubmed/32410573 http://dx.doi.org/10.1186/s12859-020-3517-7 |
_version_ | 1783533607282802688 |
---|---|
author | Kilicoglu, Halil Rosemblat, Graciela Fiszman, Marcelo Shin, Dongwook |
author_facet | Kilicoglu, Halil Rosemblat, Graciela Fiszman, Marcelo Shin, Dongwook |
author_sort | Kilicoglu, Halil |
collection | PubMed |
description | BACKGROUND: In the era of information overload, natural language processing (NLP) techniques are increasingly needed to support advanced biomedical information management and discovery applications. In this paper, we present an in-depth description of SemRep, an NLP system that extracts semantic relations from PubMed abstracts using linguistic principles and UMLS domain knowledge. We also evaluate SemRep on two datasets. In one evaluation, we use a manually annotated test collection and perform a comprehensive error analysis. In another evaluation, we assess SemRep’s performance on the CDR dataset, a standard benchmark corpus annotated with causal chemical-disease relationships. RESULTS: A strict evaluation of SemRep on our manually annotated dataset yields 0.55 precision, 0.34 recall, and 0.42 F (1) score. A relaxed evaluation, which more accurately characterizes SemRep performance, yields 0.69 precision, 0.42 recall, and 0.52 F (1) score. An error analysis reveals named entity recognition/normalization as the largest source of errors (26.9%), followed by argument identification (14%) and trigger detection errors (12.5%). The evaluation on the CDR corpus yields 0.90 precision, 0.24 recall, and 0.38 F (1) score. The recall and the F (1) score increase to 0.35 and 0.50, respectively, when the evaluation on this corpus is limited to sentence-bound relationships, which represents a fairer evaluation, as SemRep operates at the sentence level. CONCLUSIONS: SemRep is a broad-coverage, interpretable, strong baseline system for extracting semantic relations from biomedical text. It also underpins SemMedDB, a literature-scale knowledge graph based on semantic relations. Through SemMedDB, SemRep has had significant impact in the scientific community, supporting a variety of clinical and translational applications, including clinical decision making, medical diagnosis, drug repurposing, literature-based discovery and hypothesis generation, and contributing to improved health outcomes. In ongoing development, we are redesigning SemRep to increase its modularity and flexibility, and addressing weaknesses identified in the error analysis. |
format | Online Article Text |
id | pubmed-7222583 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-72225832020-05-27 Broad-coverage biomedical relation extraction with SemRep Kilicoglu, Halil Rosemblat, Graciela Fiszman, Marcelo Shin, Dongwook BMC Bioinformatics Software BACKGROUND: In the era of information overload, natural language processing (NLP) techniques are increasingly needed to support advanced biomedical information management and discovery applications. In this paper, we present an in-depth description of SemRep, an NLP system that extracts semantic relations from PubMed abstracts using linguistic principles and UMLS domain knowledge. We also evaluate SemRep on two datasets. In one evaluation, we use a manually annotated test collection and perform a comprehensive error analysis. In another evaluation, we assess SemRep’s performance on the CDR dataset, a standard benchmark corpus annotated with causal chemical-disease relationships. RESULTS: A strict evaluation of SemRep on our manually annotated dataset yields 0.55 precision, 0.34 recall, and 0.42 F (1) score. A relaxed evaluation, which more accurately characterizes SemRep performance, yields 0.69 precision, 0.42 recall, and 0.52 F (1) score. An error analysis reveals named entity recognition/normalization as the largest source of errors (26.9%), followed by argument identification (14%) and trigger detection errors (12.5%). The evaluation on the CDR corpus yields 0.90 precision, 0.24 recall, and 0.38 F (1) score. The recall and the F (1) score increase to 0.35 and 0.50, respectively, when the evaluation on this corpus is limited to sentence-bound relationships, which represents a fairer evaluation, as SemRep operates at the sentence level. CONCLUSIONS: SemRep is a broad-coverage, interpretable, strong baseline system for extracting semantic relations from biomedical text. It also underpins SemMedDB, a literature-scale knowledge graph based on semantic relations. Through SemMedDB, SemRep has had significant impact in the scientific community, supporting a variety of clinical and translational applications, including clinical decision making, medical diagnosis, drug repurposing, literature-based discovery and hypothesis generation, and contributing to improved health outcomes. In ongoing development, we are redesigning SemRep to increase its modularity and flexibility, and addressing weaknesses identified in the error analysis. BioMed Central 2020-05-14 /pmc/articles/PMC7222583/ /pubmed/32410573 http://dx.doi.org/10.1186/s12859-020-3517-7 Text en © The Author(s) 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visithttp://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Software Kilicoglu, Halil Rosemblat, Graciela Fiszman, Marcelo Shin, Dongwook Broad-coverage biomedical relation extraction with SemRep |
title | Broad-coverage biomedical relation extraction with SemRep |
title_full | Broad-coverage biomedical relation extraction with SemRep |
title_fullStr | Broad-coverage biomedical relation extraction with SemRep |
title_full_unstemmed | Broad-coverage biomedical relation extraction with SemRep |
title_short | Broad-coverage biomedical relation extraction with SemRep |
title_sort | broad-coverage biomedical relation extraction with semrep |
topic | Software |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7222583/ https://www.ncbi.nlm.nih.gov/pubmed/32410573 http://dx.doi.org/10.1186/s12859-020-3517-7 |
work_keys_str_mv | AT kilicogluhalil broadcoveragebiomedicalrelationextractionwithsemrep AT rosemblatgraciela broadcoveragebiomedicalrelationextractionwithsemrep AT fiszmanmarcelo broadcoveragebiomedicalrelationextractionwithsemrep AT shindongwook broadcoveragebiomedicalrelationextractionwithsemrep |