Cargando…
Expanding a database-derived biomedical knowledge graph via multi-relation extraction from biomedical abstracts
BACKGROUND: Knowledge graphs support biomedical research efforts by providing contextual information for biomedical entities, constructing networks, and supporting the interpretation of high-throughput analyses. These databases are populated via manual curation, which is challenging to scale with an...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9578183/ https://www.ncbi.nlm.nih.gov/pubmed/36258252 http://dx.doi.org/10.1186/s13040-022-00311-z |
_version_ | 1784811917223657472 |
---|---|
author | Nicholson, David N. Himmelstein, Daniel S. Greene, Casey S. |
author_facet | Nicholson, David N. Himmelstein, Daniel S. Greene, Casey S. |
author_sort | Nicholson, David N. |
collection | PubMed |
description | BACKGROUND: Knowledge graphs support biomedical research efforts by providing contextual information for biomedical entities, constructing networks, and supporting the interpretation of high-throughput analyses. These databases are populated via manual curation, which is challenging to scale with an exponentially rising publication rate. Data programming is a paradigm that circumvents this arduous manual process by combining databases with simple rules and heuristics written as label functions, which are programs designed to annotate textual data automatically. Unfortunately, writing a useful label function requires substantial error analysis and is a nontrivial task that takes multiple days per function. This bottleneck makes populating a knowledge graph with multiple nodes and edge types practically infeasible. Thus, we sought to accelerate the label function creation process by evaluating how label functions can be re-used across multiple edge types. RESULTS: We obtained entity-tagged abstracts and subsetted these entities to only contain compounds, genes, and disease mentions. We extracted sentences containing co-mentions of certain biomedical entities contained in a previously described knowledge graph, Hetionet v1. We trained a baseline model that used database-only label functions and then used a sampling approach to measure how well adding edge-specific or edge-mismatch label function combinations improved over our baseline. Next, we trained a discriminator model to detect sentences that indicated a biomedical relationship and then estimated the number of edge types that could be recalled and added to Hetionet v1. We found that adding edge-mismatch label functions rarely improved relationship extraction, while control edge-specific label functions did. There were two exceptions to this trend, Compound-binds-Gene and Gene-interacts-Gene, which both indicated physical relationships and showed signs of transferability. Across the scenarios tested, discriminative model performance strongly depends on generated annotations. Using the best discriminative model for each edge type, we recalled close to 30% of established edges within Hetionet v1. CONCLUSIONS: Our results show that this framework can incorporate novel edges into our source knowledge graph. However, results with label function transfer were mixed. Only label functions describing very similar edge types supported improved performance when transferred. We expect that the continued development of this strategy may provide essential building blocks to populating biomedical knowledge graphs with discoveries, ensuring that these resources include cutting-edge results. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13040-022-00311-z. |
format | Online Article Text |
id | pubmed-9578183 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-95781832022-10-19 Expanding a database-derived biomedical knowledge graph via multi-relation extraction from biomedical abstracts Nicholson, David N. Himmelstein, Daniel S. Greene, Casey S. BioData Min Research BACKGROUND: Knowledge graphs support biomedical research efforts by providing contextual information for biomedical entities, constructing networks, and supporting the interpretation of high-throughput analyses. These databases are populated via manual curation, which is challenging to scale with an exponentially rising publication rate. Data programming is a paradigm that circumvents this arduous manual process by combining databases with simple rules and heuristics written as label functions, which are programs designed to annotate textual data automatically. Unfortunately, writing a useful label function requires substantial error analysis and is a nontrivial task that takes multiple days per function. This bottleneck makes populating a knowledge graph with multiple nodes and edge types practically infeasible. Thus, we sought to accelerate the label function creation process by evaluating how label functions can be re-used across multiple edge types. RESULTS: We obtained entity-tagged abstracts and subsetted these entities to only contain compounds, genes, and disease mentions. We extracted sentences containing co-mentions of certain biomedical entities contained in a previously described knowledge graph, Hetionet v1. We trained a baseline model that used database-only label functions and then used a sampling approach to measure how well adding edge-specific or edge-mismatch label function combinations improved over our baseline. Next, we trained a discriminator model to detect sentences that indicated a biomedical relationship and then estimated the number of edge types that could be recalled and added to Hetionet v1. We found that adding edge-mismatch label functions rarely improved relationship extraction, while control edge-specific label functions did. There were two exceptions to this trend, Compound-binds-Gene and Gene-interacts-Gene, which both indicated physical relationships and showed signs of transferability. Across the scenarios tested, discriminative model performance strongly depends on generated annotations. Using the best discriminative model for each edge type, we recalled close to 30% of established edges within Hetionet v1. CONCLUSIONS: Our results show that this framework can incorporate novel edges into our source knowledge graph. However, results with label function transfer were mixed. Only label functions describing very similar edge types supported improved performance when transferred. We expect that the continued development of this strategy may provide essential building blocks to populating biomedical knowledge graphs with discoveries, ensuring that these resources include cutting-edge results. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13040-022-00311-z. BioMed Central 2022-10-18 /pmc/articles/PMC9578183/ /pubmed/36258252 http://dx.doi.org/10.1186/s13040-022-00311-z Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Research Nicholson, David N. Himmelstein, Daniel S. Greene, Casey S. Expanding a database-derived biomedical knowledge graph via multi-relation extraction from biomedical abstracts |
title | Expanding a database-derived biomedical knowledge graph via multi-relation extraction from biomedical abstracts |
title_full | Expanding a database-derived biomedical knowledge graph via multi-relation extraction from biomedical abstracts |
title_fullStr | Expanding a database-derived biomedical knowledge graph via multi-relation extraction from biomedical abstracts |
title_full_unstemmed | Expanding a database-derived biomedical knowledge graph via multi-relation extraction from biomedical abstracts |
title_short | Expanding a database-derived biomedical knowledge graph via multi-relation extraction from biomedical abstracts |
title_sort | expanding a database-derived biomedical knowledge graph via multi-relation extraction from biomedical abstracts |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9578183/ https://www.ncbi.nlm.nih.gov/pubmed/36258252 http://dx.doi.org/10.1186/s13040-022-00311-z |
work_keys_str_mv | AT nicholsondavidn expandingadatabasederivedbiomedicalknowledgegraphviamultirelationextractionfrombiomedicalabstracts AT himmelsteindaniels expandingadatabasederivedbiomedicalknowledgegraphviamultirelationextractionfrombiomedicalabstracts AT greenecaseys expandingadatabasederivedbiomedicalknowledgegraphviamultirelationextractionfrombiomedicalabstracts |