Cargando…
PEDL: extracting protein–protein associations using deep language models and distant supervision
MOTIVATION: A significant portion of molecular biology investigates signalling pathways and thus depends on an up-to-date and complete resource of functional protein–protein associations (PPAs) that constitute such pathways. Despite extensive curation efforts, major pathway databases are still notor...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7355289/ https://www.ncbi.nlm.nih.gov/pubmed/32657389 http://dx.doi.org/10.1093/bioinformatics/btaa430 |
_version_ | 1783558245539905536 |
---|---|
author | Weber, Leon Thobe, Kirsten Migueles Lozano, Oscar Arturo Wolf, Jana Leser, Ulf |
author_facet | Weber, Leon Thobe, Kirsten Migueles Lozano, Oscar Arturo Wolf, Jana Leser, Ulf |
author_sort | Weber, Leon |
collection | PubMed |
description | MOTIVATION: A significant portion of molecular biology investigates signalling pathways and thus depends on an up-to-date and complete resource of functional protein–protein associations (PPAs) that constitute such pathways. Despite extensive curation efforts, major pathway databases are still notoriously incomplete. Relation extraction can help to gather such pathway information from biomedical publications. Current methods for extracting PPAs typically rely exclusively on rare manually labelled data which severely limits their performance. RESULTS: We propose PPA Extraction with Deep Language (PEDL), a method for predicting PPAs from text that combines deep language models and distant supervision. Due to the reliance on distant supervision, PEDL has access to an order of magnitude more training data than methods solely relying on manually labelled annotations. We introduce three different datasets for PPA prediction and evaluate PEDL for the two subtasks of predicting PPAs between two proteins, as well as identifying the text spans stating the PPA. We compared PEDL with a recently published state-of-the-art model and found that on average PEDL performs better in both tasks on all three datasets. An expert evaluation demonstrates that PEDL can be used to predict PPAs that are missing from major pathway databases and that it correctly identifies the text spans supporting the PPA. AVAILABILITY AND IMPLEMENTATION: PEDL is freely available at https://github.com/leonweber/pedl. The repository also includes scripts to generate the used datasets and to reproduce the experiments from this article. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. |
format | Online Article Text |
id | pubmed-7355289 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-73552892020-07-16 PEDL: extracting protein–protein associations using deep language models and distant supervision Weber, Leon Thobe, Kirsten Migueles Lozano, Oscar Arturo Wolf, Jana Leser, Ulf Bioinformatics Systems Biology and Networks MOTIVATION: A significant portion of molecular biology investigates signalling pathways and thus depends on an up-to-date and complete resource of functional protein–protein associations (PPAs) that constitute such pathways. Despite extensive curation efforts, major pathway databases are still notoriously incomplete. Relation extraction can help to gather such pathway information from biomedical publications. Current methods for extracting PPAs typically rely exclusively on rare manually labelled data which severely limits their performance. RESULTS: We propose PPA Extraction with Deep Language (PEDL), a method for predicting PPAs from text that combines deep language models and distant supervision. Due to the reliance on distant supervision, PEDL has access to an order of magnitude more training data than methods solely relying on manually labelled annotations. We introduce three different datasets for PPA prediction and evaluate PEDL for the two subtasks of predicting PPAs between two proteins, as well as identifying the text spans stating the PPA. We compared PEDL with a recently published state-of-the-art model and found that on average PEDL performs better in both tasks on all three datasets. An expert evaluation demonstrates that PEDL can be used to predict PPAs that are missing from major pathway databases and that it correctly identifies the text spans supporting the PPA. AVAILABILITY AND IMPLEMENTATION: PEDL is freely available at https://github.com/leonweber/pedl. The repository also includes scripts to generate the used datasets and to reproduce the experiments from this article. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2020-07 2020-07-13 /pmc/articles/PMC7355289/ /pubmed/32657389 http://dx.doi.org/10.1093/bioinformatics/btaa430 Text en © The Author(s) 2020. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com |
spellingShingle | Systems Biology and Networks Weber, Leon Thobe, Kirsten Migueles Lozano, Oscar Arturo Wolf, Jana Leser, Ulf PEDL: extracting protein–protein associations using deep language models and distant supervision |
title | PEDL: extracting protein–protein associations using deep language models and distant supervision |
title_full | PEDL: extracting protein–protein associations using deep language models and distant supervision |
title_fullStr | PEDL: extracting protein–protein associations using deep language models and distant supervision |
title_full_unstemmed | PEDL: extracting protein–protein associations using deep language models and distant supervision |
title_short | PEDL: extracting protein–protein associations using deep language models and distant supervision |
title_sort | pedl: extracting protein–protein associations using deep language models and distant supervision |
topic | Systems Biology and Networks |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7355289/ https://www.ncbi.nlm.nih.gov/pubmed/32657389 http://dx.doi.org/10.1093/bioinformatics/btaa430 |
work_keys_str_mv | AT weberleon pedlextractingproteinproteinassociationsusingdeeplanguagemodelsanddistantsupervision AT thobekirsten pedlextractingproteinproteinassociationsusingdeeplanguagemodelsanddistantsupervision AT migueleslozanooscararturo pedlextractingproteinproteinassociationsusingdeeplanguagemodelsanddistantsupervision AT wolfjana pedlextractingproteinproteinassociationsusingdeeplanguagemodelsanddistantsupervision AT leserulf pedlextractingproteinproteinassociationsusingdeeplanguagemodelsanddistantsupervision |