Cargando…

STonKGs: a sophisticated transformer trained on biomedical text and knowledge graphs

MOTIVATION: The majority of biomedical knowledge is stored in structured databases or as unstructured text in scientific publications. This vast amount of information has led to numerous machine learning-based biological applications using either text through natural language processing (NLP) or str...

Descripción completa

Detalles Bibliográficos
Autores principales: Balabin, Helena, Hoyt, Charles Tapley, Birkenbihl, Colin, Gyori, Benjamin M, Bachman, John, Kodamullil, Alpha Tom, Plöger, Paul G, Hofmann-Apitius, Martin, Domingo-Fernández, Daniel
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8896635/
https://www.ncbi.nlm.nih.gov/pubmed/34986221
http://dx.doi.org/10.1093/bioinformatics/btac001
_version_ 1784663204511612928
author Balabin, Helena
Hoyt, Charles Tapley
Birkenbihl, Colin
Gyori, Benjamin M
Bachman, John
Kodamullil, Alpha Tom
Plöger, Paul G
Hofmann-Apitius, Martin
Domingo-Fernández, Daniel
author_facet Balabin, Helena
Hoyt, Charles Tapley
Birkenbihl, Colin
Gyori, Benjamin M
Bachman, John
Kodamullil, Alpha Tom
Plöger, Paul G
Hofmann-Apitius, Martin
Domingo-Fernández, Daniel
author_sort Balabin, Helena
collection PubMed
description MOTIVATION: The majority of biomedical knowledge is stored in structured databases or as unstructured text in scientific publications. This vast amount of information has led to numerous machine learning-based biological applications using either text through natural language processing (NLP) or structured data through knowledge graph embedding models. However, representations based on a single modality are inherently limited. RESULTS: To generate better representations of biological knowledge, we propose STonKGs, a Sophisticated Transformer trained on biomedical text and Knowledge Graphs (KGs). This multimodal Transformer uses combined input sequences of structured information from KGs and unstructured text data from biomedical literature to learn joint representations in a shared embedding space. First, we pre-trained STonKGs on a knowledge base assembled by the Integrated Network and Dynamical Reasoning Assembler consisting of millions of text-triple pairs extracted from biomedical literature by multiple NLP systems. Then, we benchmarked STonKGs against three baseline models trained on either one of the modalities (i.e. text or KG) across eight different classification tasks, each corresponding to a different biological application. Our results demonstrate that STonKGs outperforms both baselines, especially on the more challenging tasks with respect to the number of classes, improving upon the F1-score of the best baseline by up to 0.084 (i.e. from 0.881 to 0.965). Finally, our pre-trained model as well as the model architecture can be adapted to various other transfer learning applications. AVAILABILITY AND IMPLEMENTATION: We make the source code and the Python package of STonKGs available at GitHub (https://github.com/stonkgs/stonkgs) and PyPI (https://pypi.org/project/stonkgs/). The pre-trained STonKGs models and the task-specific classification models are respectively available at https://huggingface.co/stonkgs/stonkgs-150k and https://zenodo.org/communities/stonkgs. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-8896635
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-88966352022-03-07 STonKGs: a sophisticated transformer trained on biomedical text and knowledge graphs Balabin, Helena Hoyt, Charles Tapley Birkenbihl, Colin Gyori, Benjamin M Bachman, John Kodamullil, Alpha Tom Plöger, Paul G Hofmann-Apitius, Martin Domingo-Fernández, Daniel Bioinformatics Original Papers MOTIVATION: The majority of biomedical knowledge is stored in structured databases or as unstructured text in scientific publications. This vast amount of information has led to numerous machine learning-based biological applications using either text through natural language processing (NLP) or structured data through knowledge graph embedding models. However, representations based on a single modality are inherently limited. RESULTS: To generate better representations of biological knowledge, we propose STonKGs, a Sophisticated Transformer trained on biomedical text and Knowledge Graphs (KGs). This multimodal Transformer uses combined input sequences of structured information from KGs and unstructured text data from biomedical literature to learn joint representations in a shared embedding space. First, we pre-trained STonKGs on a knowledge base assembled by the Integrated Network and Dynamical Reasoning Assembler consisting of millions of text-triple pairs extracted from biomedical literature by multiple NLP systems. Then, we benchmarked STonKGs against three baseline models trained on either one of the modalities (i.e. text or KG) across eight different classification tasks, each corresponding to a different biological application. Our results demonstrate that STonKGs outperforms both baselines, especially on the more challenging tasks with respect to the number of classes, improving upon the F1-score of the best baseline by up to 0.084 (i.e. from 0.881 to 0.965). Finally, our pre-trained model as well as the model architecture can be adapted to various other transfer learning applications. AVAILABILITY AND IMPLEMENTATION: We make the source code and the Python package of STonKGs available at GitHub (https://github.com/stonkgs/stonkgs) and PyPI (https://pypi.org/project/stonkgs/). The pre-trained STonKGs models and the task-specific classification models are respectively available at https://huggingface.co/stonkgs/stonkgs-150k and https://zenodo.org/communities/stonkgs. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2022-01-05 /pmc/articles/PMC8896635/ /pubmed/34986221 http://dx.doi.org/10.1093/bioinformatics/btac001 Text en © The Author(s) 2022. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Papers
Balabin, Helena
Hoyt, Charles Tapley
Birkenbihl, Colin
Gyori, Benjamin M
Bachman, John
Kodamullil, Alpha Tom
Plöger, Paul G
Hofmann-Apitius, Martin
Domingo-Fernández, Daniel
STonKGs: a sophisticated transformer trained on biomedical text and knowledge graphs
title STonKGs: a sophisticated transformer trained on biomedical text and knowledge graphs
title_full STonKGs: a sophisticated transformer trained on biomedical text and knowledge graphs
title_fullStr STonKGs: a sophisticated transformer trained on biomedical text and knowledge graphs
title_full_unstemmed STonKGs: a sophisticated transformer trained on biomedical text and knowledge graphs
title_short STonKGs: a sophisticated transformer trained on biomedical text and knowledge graphs
title_sort stonkgs: a sophisticated transformer trained on biomedical text and knowledge graphs
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8896635/
https://www.ncbi.nlm.nih.gov/pubmed/34986221
http://dx.doi.org/10.1093/bioinformatics/btac001
work_keys_str_mv AT balabinhelena stonkgsasophisticatedtransformertrainedonbiomedicaltextandknowledgegraphs
AT hoytcharlestapley stonkgsasophisticatedtransformertrainedonbiomedicaltextandknowledgegraphs
AT birkenbihlcolin stonkgsasophisticatedtransformertrainedonbiomedicaltextandknowledgegraphs
AT gyoribenjaminm stonkgsasophisticatedtransformertrainedonbiomedicaltextandknowledgegraphs
AT bachmanjohn stonkgsasophisticatedtransformertrainedonbiomedicaltextandknowledgegraphs
AT kodamullilalphatom stonkgsasophisticatedtransformertrainedonbiomedicaltextandknowledgegraphs
AT plogerpaulg stonkgsasophisticatedtransformertrainedonbiomedicaltextandknowledgegraphs
AT hofmannapitiusmartin stonkgsasophisticatedtransformertrainedonbiomedicaltextandknowledgegraphs
AT domingofernandezdaniel stonkgsasophisticatedtransformertrainedonbiomedicaltextandknowledgegraphs