Cargando…

Broad-coverage biomedical relation extraction with SemRep

BACKGROUND: In the era of information overload, natural language processing (NLP) techniques are increasingly needed to support advanced biomedical information management and discovery applications. In this paper, we present an in-depth description of SemRep, an NLP system that extracts semantic rel...

Descripción completa

Detalles Bibliográficos
Autores principales:	Kilicoglu, Halil, Rosemblat, Graciela, Fiszman, Marcelo, Shin, Dongwook
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2020
Materias:	Software
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7222583/ https://www.ncbi.nlm.nih.gov/pubmed/32410573 http://dx.doi.org/10.1186/s12859-020-3517-7

_version_	1783533607282802688
author	Kilicoglu, Halil Rosemblat, Graciela Fiszman, Marcelo Shin, Dongwook
author_facet	Kilicoglu, Halil Rosemblat, Graciela Fiszman, Marcelo Shin, Dongwook
author_sort	Kilicoglu, Halil
collection	PubMed
description	BACKGROUND: In the era of information overload, natural language processing (NLP) techniques are increasingly needed to support advanced biomedical information management and discovery applications. In this paper, we present an in-depth description of SemRep, an NLP system that extracts semantic relations from PubMed abstracts using linguistic principles and UMLS domain knowledge. We also evaluate SemRep on two datasets. In one evaluation, we use a manually annotated test collection and perform a comprehensive error analysis. In another evaluation, we assess SemRep’s performance on the CDR dataset, a standard benchmark corpus annotated with causal chemical-disease relationships. RESULTS: A strict evaluation of SemRep on our manually annotated dataset yields 0.55 precision, 0.34 recall, and 0.42 F (1) score. A relaxed evaluation, which more accurately characterizes SemRep performance, yields 0.69 precision, 0.42 recall, and 0.52 F (1) score. An error analysis reveals named entity recognition/normalization as the largest source of errors (26.9%), followed by argument identification (14%) and trigger detection errors (12.5%). The evaluation on the CDR corpus yields 0.90 precision, 0.24 recall, and 0.38 F (1) score. The recall and the F (1) score increase to 0.35 and 0.50, respectively, when the evaluation on this corpus is limited to sentence-bound relationships, which represents a fairer evaluation, as SemRep operates at the sentence level. CONCLUSIONS: SemRep is a broad-coverage, interpretable, strong baseline system for extracting semantic relations from biomedical text. It also underpins SemMedDB, a literature-scale knowledge graph based on semantic relations. Through SemMedDB, SemRep has had significant impact in the scientific community, supporting a variety of clinical and translational applications, including clinical decision making, medical diagnosis, drug repurposing, literature-based discovery and hypothesis generation, and contributing to improved health outcomes. In ongoing development, we are redesigning SemRep to increase its modularity and flexibility, and addressing weaknesses identified in the error analysis.
format	Online Article Text
id	pubmed-7222583
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-72225832020-05-27 Broad-coverage biomedical relation extraction with SemRep Kilicoglu, Halil Rosemblat, Graciela Fiszman, Marcelo Shin, Dongwook BMC Bioinformatics Software BACKGROUND: In the era of information overload, natural language processing (NLP) techniques are increasingly needed to support advanced biomedical information management and discovery applications. In this paper, we present an in-depth description of SemRep, an NLP system that extracts semantic relations from PubMed abstracts using linguistic principles and UMLS domain knowledge. We also evaluate SemRep on two datasets. In one evaluation, we use a manually annotated test collection and perform a comprehensive error analysis. In another evaluation, we assess SemRep’s performance on the CDR dataset, a standard benchmark corpus annotated with causal chemical-disease relationships. RESULTS: A strict evaluation of SemRep on our manually annotated dataset yields 0.55 precision, 0.34 recall, and 0.42 F (1) score. A relaxed evaluation, which more accurately characterizes SemRep performance, yields 0.69 precision, 0.42 recall, and 0.52 F (1) score. An error analysis reveals named entity recognition/normalization as the largest source of errors (26.9%), followed by argument identification (14%) and trigger detection errors (12.5%). The evaluation on the CDR corpus yields 0.90 precision, 0.24 recall, and 0.38 F (1) score. The recall and the F (1) score increase to 0.35 and 0.50, respectively, when the evaluation on this corpus is limited to sentence-bound relationships, which represents a fairer evaluation, as SemRep operates at the sentence level. CONCLUSIONS: SemRep is a broad-coverage, interpretable, strong baseline system for extracting semantic relations from biomedical text. It also underpins SemMedDB, a literature-scale knowledge graph based on semantic relations. Through SemMedDB, SemRep has had significant impact in the scientific community, supporting a variety of clinical and translational applications, including clinical decision making, medical diagnosis, drug repurposing, literature-based discovery and hypothesis generation, and contributing to improved health outcomes. In ongoing development, we are redesigning SemRep to increase its modularity and flexibility, and addressing weaknesses identified in the error analysis. BioMed Central 2020-05-14 /pmc/articles/PMC7222583/ /pubmed/32410573 http://dx.doi.org/10.1186/s12859-020-3517-7 Text en © The Author(s) 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visithttp://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Software Kilicoglu, Halil Rosemblat, Graciela Fiszman, Marcelo Shin, Dongwook Broad-coverage biomedical relation extraction with SemRep
title	Broad-coverage biomedical relation extraction with SemRep
title_full	Broad-coverage biomedical relation extraction with SemRep
title_fullStr	Broad-coverage biomedical relation extraction with SemRep
title_full_unstemmed	Broad-coverage biomedical relation extraction with SemRep
title_short	Broad-coverage biomedical relation extraction with SemRep
title_sort	broad-coverage biomedical relation extraction with semrep
topic	Software
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7222583/ https://www.ncbi.nlm.nih.gov/pubmed/32410573 http://dx.doi.org/10.1186/s12859-020-3517-7
work_keys_str_mv	AT kilicogluhalil broadcoveragebiomedicalrelationextractionwithsemrep AT rosemblatgraciela broadcoveragebiomedicalrelationextractionwithsemrep AT fiszmanmarcelo broadcoveragebiomedicalrelationextractionwithsemrep AT shindongwook broadcoveragebiomedicalrelationextractionwithsemrep

Broad-coverage biomedical relation extraction with SemRep

Ejemplares similares