Cargando…

Expanding a database-derived biomedical knowledge graph via multi-relation extraction from biomedical abstracts

BACKGROUND: Knowledge graphs support biomedical research efforts by providing contextual information for biomedical entities, constructing networks, and supporting the interpretation of high-throughput analyses. These databases are populated via manual curation, which is challenging to scale with an...

Descripción completa

Detalles Bibliográficos
Autores principales:	Nicholson, David N., Himmelstein, Daniel S., Greene, Casey S.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2022
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9578183/ https://www.ncbi.nlm.nih.gov/pubmed/36258252 http://dx.doi.org/10.1186/s13040-022-00311-z

_version_	1784811917223657472
author	Nicholson, David N. Himmelstein, Daniel S. Greene, Casey S.
author_facet	Nicholson, David N. Himmelstein, Daniel S. Greene, Casey S.
author_sort	Nicholson, David N.
collection	PubMed
description	BACKGROUND: Knowledge graphs support biomedical research efforts by providing contextual information for biomedical entities, constructing networks, and supporting the interpretation of high-throughput analyses. These databases are populated via manual curation, which is challenging to scale with an exponentially rising publication rate. Data programming is a paradigm that circumvents this arduous manual process by combining databases with simple rules and heuristics written as label functions, which are programs designed to annotate textual data automatically. Unfortunately, writing a useful label function requires substantial error analysis and is a nontrivial task that takes multiple days per function. This bottleneck makes populating a knowledge graph with multiple nodes and edge types practically infeasible. Thus, we sought to accelerate the label function creation process by evaluating how label functions can be re-used across multiple edge types. RESULTS: We obtained entity-tagged abstracts and subsetted these entities to only contain compounds, genes, and disease mentions. We extracted sentences containing co-mentions of certain biomedical entities contained in a previously described knowledge graph, Hetionet v1. We trained a baseline model that used database-only label functions and then used a sampling approach to measure how well adding edge-specific or edge-mismatch label function combinations improved over our baseline. Next, we trained a discriminator model to detect sentences that indicated a biomedical relationship and then estimated the number of edge types that could be recalled and added to Hetionet v1. We found that adding edge-mismatch label functions rarely improved relationship extraction, while control edge-specific label functions did. There were two exceptions to this trend, Compound-binds-Gene and Gene-interacts-Gene, which both indicated physical relationships and showed signs of transferability. Across the scenarios tested, discriminative model performance strongly depends on generated annotations. Using the best discriminative model for each edge type, we recalled close to 30% of established edges within Hetionet v1. CONCLUSIONS: Our results show that this framework can incorporate novel edges into our source knowledge graph. However, results with label function transfer were mixed. Only label functions describing very similar edge types supported improved performance when transferred. We expect that the continued development of this strategy may provide essential building blocks to populating biomedical knowledge graphs with discoveries, ensuring that these resources include cutting-edge results. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13040-022-00311-z.
format	Online Article Text
id	pubmed-9578183
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-95781832022-10-19 Expanding a database-derived biomedical knowledge graph via multi-relation extraction from biomedical abstracts Nicholson, David N. Himmelstein, Daniel S. Greene, Casey S. BioData Min Research BACKGROUND: Knowledge graphs support biomedical research efforts by providing contextual information for biomedical entities, constructing networks, and supporting the interpretation of high-throughput analyses. These databases are populated via manual curation, which is challenging to scale with an exponentially rising publication rate. Data programming is a paradigm that circumvents this arduous manual process by combining databases with simple rules and heuristics written as label functions, which are programs designed to annotate textual data automatically. Unfortunately, writing a useful label function requires substantial error analysis and is a nontrivial task that takes multiple days per function. This bottleneck makes populating a knowledge graph with multiple nodes and edge types practically infeasible. Thus, we sought to accelerate the label function creation process by evaluating how label functions can be re-used across multiple edge types. RESULTS: We obtained entity-tagged abstracts and subsetted these entities to only contain compounds, genes, and disease mentions. We extracted sentences containing co-mentions of certain biomedical entities contained in a previously described knowledge graph, Hetionet v1. We trained a baseline model that used database-only label functions and then used a sampling approach to measure how well adding edge-specific or edge-mismatch label function combinations improved over our baseline. Next, we trained a discriminator model to detect sentences that indicated a biomedical relationship and then estimated the number of edge types that could be recalled and added to Hetionet v1. We found that adding edge-mismatch label functions rarely improved relationship extraction, while control edge-specific label functions did. There were two exceptions to this trend, Compound-binds-Gene and Gene-interacts-Gene, which both indicated physical relationships and showed signs of transferability. Across the scenarios tested, discriminative model performance strongly depends on generated annotations. Using the best discriminative model for each edge type, we recalled close to 30% of established edges within Hetionet v1. CONCLUSIONS: Our results show that this framework can incorporate novel edges into our source knowledge graph. However, results with label function transfer were mixed. Only label functions describing very similar edge types supported improved performance when transferred. We expect that the continued development of this strategy may provide essential building blocks to populating biomedical knowledge graphs with discoveries, ensuring that these resources include cutting-edge results. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13040-022-00311-z. BioMed Central 2022-10-18 /pmc/articles/PMC9578183/ /pubmed/36258252 http://dx.doi.org/10.1186/s13040-022-00311-z Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Research Nicholson, David N. Himmelstein, Daniel S. Greene, Casey S. Expanding a database-derived biomedical knowledge graph via multi-relation extraction from biomedical abstracts
title	Expanding a database-derived biomedical knowledge graph via multi-relation extraction from biomedical abstracts
title_full	Expanding a database-derived biomedical knowledge graph via multi-relation extraction from biomedical abstracts
title_fullStr	Expanding a database-derived biomedical knowledge graph via multi-relation extraction from biomedical abstracts
title_full_unstemmed	Expanding a database-derived biomedical knowledge graph via multi-relation extraction from biomedical abstracts
title_short	Expanding a database-derived biomedical knowledge graph via multi-relation extraction from biomedical abstracts
title_sort	expanding a database-derived biomedical knowledge graph via multi-relation extraction from biomedical abstracts
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9578183/ https://www.ncbi.nlm.nih.gov/pubmed/36258252 http://dx.doi.org/10.1186/s13040-022-00311-z
work_keys_str_mv	AT nicholsondavidn expandingadatabasederivedbiomedicalknowledgegraphviamultirelationextractionfrombiomedicalabstracts AT himmelsteindaniels expandingadatabasederivedbiomedicalknowledgegraphviamultirelationextractionfrombiomedicalabstracts AT greenecaseys expandingadatabasederivedbiomedicalknowledgegraphviamultirelationextractionfrombiomedicalabstracts

Expanding a database-derived biomedical knowledge graph via multi-relation extraction from biomedical abstracts

Ejemplares similares