Cargando…

Overview of DrugProt task at BioCreative VII: data and methods for large-scale text mining and knowledge graph generation of heterogenous chemical–protein relations

It is getting increasingly challenging to efficiently exploit drug-related information described in the growing amount of scientific literature. Indeed, for drug–gene/protein interactions, the challenge is even bigger, considering the scattered information sources and types of interactions. However,...

Descripción completa

Detalles Bibliográficos
Autores principales:	Miranda-Escalada, Antonio, Mehryary, Farrokh, Luoma, Jouni, Estrada-Zavala, Darryl, Gasco, Luis, Pyysalo, Sampo, Valencia, Alfonso, Krallinger, Martin
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2023
Materias:	Original Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10683943/ https://www.ncbi.nlm.nih.gov/pubmed/38015956 http://dx.doi.org/10.1093/database/baad080

_version_	1785151293349691392
author	Miranda-Escalada, Antonio Mehryary, Farrokh Luoma, Jouni Estrada-Zavala, Darryl Gasco, Luis Pyysalo, Sampo Valencia, Alfonso Krallinger, Martin
author_facet	Miranda-Escalada, Antonio Mehryary, Farrokh Luoma, Jouni Estrada-Zavala, Darryl Gasco, Luis Pyysalo, Sampo Valencia, Alfonso Krallinger, Martin
author_sort	Miranda-Escalada, Antonio
collection	PubMed
description	It is getting increasingly challenging to efficiently exploit drug-related information described in the growing amount of scientific literature. Indeed, for drug–gene/protein interactions, the challenge is even bigger, considering the scattered information sources and types of interactions. However, their systematic, large-scale exploitation is key for developing tools, impacting knowledge fields as diverse as drug design or metabolic pathway research. Previous efforts in the extraction of drug–gene/protein interactions from the literature did not address these scalability and granularity issues. To tackle them, we have organized the DrugProt track at BioCreative VII. In the context of the track, we have released the DrugProt Gold Standard corpus, a collection of 5000 PubMed abstracts, manually annotated with granular drug–gene/protein interactions. We have proposed a novel large-scale track to evaluate the capacity of natural language processing systems to scale to the range of millions of documents, and generate with their predictions a silver standard knowledge graph of 53 993 602 nodes and 19 367 406 edges. Its use exceeds the shared task and points toward pharmacological and biological applications such as drug discovery or continuous database curation. Finally, we have created a persistent evaluation scenario on CodaLab to continuously evaluate new relation extraction systems that may arise. Thirty teams from four continents, which involved 110 people, sent 107 submission runs for the Main DrugProt track, and nine teams submitted 21 runs for the Large Scale DrugProt track. Most participants implemented deep learning approaches based on pretrained transformer-like language models (LMs) such as BERT or BioBERT, reaching precision and recall values as high as 0.9167 and 0.9542 for some relation types. Finally, some initial explorations of the applicability of the knowledge graph have shown its potential to explore the chemical–protein relations described in the literature, or chemical compound–enzyme interactions. Database URL: https://doi.org/10.5281/zenodo.4955410
format	Online Article Text
id	pubmed-10683943
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-106839432023-11-30 Overview of DrugProt task at BioCreative VII: data and methods for large-scale text mining and knowledge graph generation of heterogenous chemical–protein relations Miranda-Escalada, Antonio Mehryary, Farrokh Luoma, Jouni Estrada-Zavala, Darryl Gasco, Luis Pyysalo, Sampo Valencia, Alfonso Krallinger, Martin Database (Oxford) Original Article It is getting increasingly challenging to efficiently exploit drug-related information described in the growing amount of scientific literature. Indeed, for drug–gene/protein interactions, the challenge is even bigger, considering the scattered information sources and types of interactions. However, their systematic, large-scale exploitation is key for developing tools, impacting knowledge fields as diverse as drug design or metabolic pathway research. Previous efforts in the extraction of drug–gene/protein interactions from the literature did not address these scalability and granularity issues. To tackle them, we have organized the DrugProt track at BioCreative VII. In the context of the track, we have released the DrugProt Gold Standard corpus, a collection of 5000 PubMed abstracts, manually annotated with granular drug–gene/protein interactions. We have proposed a novel large-scale track to evaluate the capacity of natural language processing systems to scale to the range of millions of documents, and generate with their predictions a silver standard knowledge graph of 53 993 602 nodes and 19 367 406 edges. Its use exceeds the shared task and points toward pharmacological and biological applications such as drug discovery or continuous database curation. Finally, we have created a persistent evaluation scenario on CodaLab to continuously evaluate new relation extraction systems that may arise. Thirty teams from four continents, which involved 110 people, sent 107 submission runs for the Main DrugProt track, and nine teams submitted 21 runs for the Large Scale DrugProt track. Most participants implemented deep learning approaches based on pretrained transformer-like language models (LMs) such as BERT or BioBERT, reaching precision and recall values as high as 0.9167 and 0.9542 for some relation types. Finally, some initial explorations of the applicability of the knowledge graph have shown its potential to explore the chemical–protein relations described in the literature, or chemical compound–enzyme interactions. Database URL: https://doi.org/10.5281/zenodo.4955410 Oxford University Press 2023-11-28 /pmc/articles/PMC10683943/ /pubmed/38015956 http://dx.doi.org/10.1093/database/baad080 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Article Miranda-Escalada, Antonio Mehryary, Farrokh Luoma, Jouni Estrada-Zavala, Darryl Gasco, Luis Pyysalo, Sampo Valencia, Alfonso Krallinger, Martin Overview of DrugProt task at BioCreative VII: data and methods for large-scale text mining and knowledge graph generation of heterogenous chemical–protein relations
title	Overview of DrugProt task at BioCreative VII: data and methods for large-scale text mining and knowledge graph generation of heterogenous chemical–protein relations
title_full	Overview of DrugProt task at BioCreative VII: data and methods for large-scale text mining and knowledge graph generation of heterogenous chemical–protein relations
title_fullStr	Overview of DrugProt task at BioCreative VII: data and methods for large-scale text mining and knowledge graph generation of heterogenous chemical–protein relations
title_full_unstemmed	Overview of DrugProt task at BioCreative VII: data and methods for large-scale text mining and knowledge graph generation of heterogenous chemical–protein relations
title_short	Overview of DrugProt task at BioCreative VII: data and methods for large-scale text mining and knowledge graph generation of heterogenous chemical–protein relations
title_sort	overview of drugprot task at biocreative vii: data and methods for large-scale text mining and knowledge graph generation of heterogenous chemical–protein relations
topic	Original Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10683943/ https://www.ncbi.nlm.nih.gov/pubmed/38015956 http://dx.doi.org/10.1093/database/baad080
work_keys_str_mv	AT mirandaescaladaantonio overviewofdrugprottaskatbiocreativeviidataandmethodsforlargescaletextminingandknowledgegraphgenerationofheterogenouschemicalproteinrelations AT mehryaryfarrokh overviewofdrugprottaskatbiocreativeviidataandmethodsforlargescaletextminingandknowledgegraphgenerationofheterogenouschemicalproteinrelations AT luomajouni overviewofdrugprottaskatbiocreativeviidataandmethodsforlargescaletextminingandknowledgegraphgenerationofheterogenouschemicalproteinrelations AT estradazavaladarryl overviewofdrugprottaskatbiocreativeviidataandmethodsforlargescaletextminingandknowledgegraphgenerationofheterogenouschemicalproteinrelations AT gascoluis overviewofdrugprottaskatbiocreativeviidataandmethodsforlargescaletextminingandknowledgegraphgenerationofheterogenouschemicalproteinrelations AT pyysalosampo overviewofdrugprottaskatbiocreativeviidataandmethodsforlargescaletextminingandknowledgegraphgenerationofheterogenouschemicalproteinrelations AT valenciaalfonso overviewofdrugprottaskatbiocreativeviidataandmethodsforlargescaletextminingandknowledgegraphgenerationofheterogenouschemicalproteinrelations AT krallingermartin overviewofdrugprottaskatbiocreativeviidataandmethodsforlargescaletextminingandknowledgegraphgenerationofheterogenouschemicalproteinrelations

Overview of DrugProt task at BioCreative VII: data and methods for large-scale text mining and knowledge graph generation of heterogenous chemical–protein relations

Ejemplares similares