Cargando…

PEDL: extracting protein–protein associations using deep language models and distant supervision

MOTIVATION: A significant portion of molecular biology investigates signalling pathways and thus depends on an up-to-date and complete resource of functional protein–protein associations (PPAs) that constitute such pathways. Despite extensive curation efforts, major pathway databases are still notor...

Descripción completa

Detalles Bibliográficos
Autores principales: Weber, Leon, Thobe, Kirsten, Migueles Lozano, Oscar Arturo, Wolf, Jana, Leser, Ulf
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7355289/
https://www.ncbi.nlm.nih.gov/pubmed/32657389
http://dx.doi.org/10.1093/bioinformatics/btaa430
_version_ 1783558245539905536
author Weber, Leon
Thobe, Kirsten
Migueles Lozano, Oscar Arturo
Wolf, Jana
Leser, Ulf
author_facet Weber, Leon
Thobe, Kirsten
Migueles Lozano, Oscar Arturo
Wolf, Jana
Leser, Ulf
author_sort Weber, Leon
collection PubMed
description MOTIVATION: A significant portion of molecular biology investigates signalling pathways and thus depends on an up-to-date and complete resource of functional protein–protein associations (PPAs) that constitute such pathways. Despite extensive curation efforts, major pathway databases are still notoriously incomplete. Relation extraction can help to gather such pathway information from biomedical publications. Current methods for extracting PPAs typically rely exclusively on rare manually labelled data which severely limits their performance. RESULTS: We propose PPA Extraction with Deep Language (PEDL), a method for predicting PPAs from text that combines deep language models and distant supervision. Due to the reliance on distant supervision, PEDL has access to an order of magnitude more training data than methods solely relying on manually labelled annotations. We introduce three different datasets for PPA prediction and evaluate PEDL for the two subtasks of predicting PPAs between two proteins, as well as identifying the text spans stating the PPA. We compared PEDL with a recently published state-of-the-art model and found that on average PEDL performs better in both tasks on all three datasets. An expert evaluation demonstrates that PEDL can be used to predict PPAs that are missing from major pathway databases and that it correctly identifies the text spans supporting the PPA. AVAILABILITY AND IMPLEMENTATION: PEDL is freely available at https://github.com/leonweber/pedl. The repository also includes scripts to generate the used datasets and to reproduce the experiments from this article. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-7355289
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-73552892020-07-16 PEDL: extracting protein–protein associations using deep language models and distant supervision Weber, Leon Thobe, Kirsten Migueles Lozano, Oscar Arturo Wolf, Jana Leser, Ulf Bioinformatics Systems Biology and Networks MOTIVATION: A significant portion of molecular biology investigates signalling pathways and thus depends on an up-to-date and complete resource of functional protein–protein associations (PPAs) that constitute such pathways. Despite extensive curation efforts, major pathway databases are still notoriously incomplete. Relation extraction can help to gather such pathway information from biomedical publications. Current methods for extracting PPAs typically rely exclusively on rare manually labelled data which severely limits their performance. RESULTS: We propose PPA Extraction with Deep Language (PEDL), a method for predicting PPAs from text that combines deep language models and distant supervision. Due to the reliance on distant supervision, PEDL has access to an order of magnitude more training data than methods solely relying on manually labelled annotations. We introduce three different datasets for PPA prediction and evaluate PEDL for the two subtasks of predicting PPAs between two proteins, as well as identifying the text spans stating the PPA. We compared PEDL with a recently published state-of-the-art model and found that on average PEDL performs better in both tasks on all three datasets. An expert evaluation demonstrates that PEDL can be used to predict PPAs that are missing from major pathway databases and that it correctly identifies the text spans supporting the PPA. AVAILABILITY AND IMPLEMENTATION: PEDL is freely available at https://github.com/leonweber/pedl. The repository also includes scripts to generate the used datasets and to reproduce the experiments from this article. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2020-07 2020-07-13 /pmc/articles/PMC7355289/ /pubmed/32657389 http://dx.doi.org/10.1093/bioinformatics/btaa430 Text en © The Author(s) 2020. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Systems Biology and Networks
Weber, Leon
Thobe, Kirsten
Migueles Lozano, Oscar Arturo
Wolf, Jana
Leser, Ulf
PEDL: extracting protein–protein associations using deep language models and distant supervision
title PEDL: extracting protein–protein associations using deep language models and distant supervision
title_full PEDL: extracting protein–protein associations using deep language models and distant supervision
title_fullStr PEDL: extracting protein–protein associations using deep language models and distant supervision
title_full_unstemmed PEDL: extracting protein–protein associations using deep language models and distant supervision
title_short PEDL: extracting protein–protein associations using deep language models and distant supervision
title_sort pedl: extracting protein–protein associations using deep language models and distant supervision
topic Systems Biology and Networks
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7355289/
https://www.ncbi.nlm.nih.gov/pubmed/32657389
http://dx.doi.org/10.1093/bioinformatics/btaa430
work_keys_str_mv AT weberleon pedlextractingproteinproteinassociationsusingdeeplanguagemodelsanddistantsupervision
AT thobekirsten pedlextractingproteinproteinassociationsusingdeeplanguagemodelsanddistantsupervision
AT migueleslozanooscararturo pedlextractingproteinproteinassociationsusingdeeplanguagemodelsanddistantsupervision
AT wolfjana pedlextractingproteinproteinassociationsusingdeeplanguagemodelsanddistantsupervision
AT leserulf pedlextractingproteinproteinassociationsusingdeeplanguagemodelsanddistantsupervision