Cargando…

Drug knowledge discovery via multi-task learning and pre-trained models

BACKGROUND: Drug repurposing is to find new indications of approved drugs, which is essential for investigating new uses for approved or investigational drug efficiency. The active gene annotation corpus (named AGAC) is annotated by human experts, which was developed to support knowledge discovery f...

Descripción completa

Detalles Bibliográficos
Autores principales:	Li, Dongfang, Xiong, Ying, Hu, Baotian, Tang, Buzhou, Peng, Weihua, Chen, Qingcai
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2021
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8596901/ https://www.ncbi.nlm.nih.gov/pubmed/34789238 http://dx.doi.org/10.1186/s12911-021-01614-7

_version_	1784600493398425600
author	Li, Dongfang Xiong, Ying Hu, Baotian Tang, Buzhou Peng, Weihua Chen, Qingcai
author_facet	Li, Dongfang Xiong, Ying Hu, Baotian Tang, Buzhou Peng, Weihua Chen, Qingcai
author_sort	Li, Dongfang
collection	PubMed
description	BACKGROUND: Drug repurposing is to find new indications of approved drugs, which is essential for investigating new uses for approved or investigational drug efficiency. The active gene annotation corpus (named AGAC) is annotated by human experts, which was developed to support knowledge discovery for drug repurposing. The AGAC track of the BioNLP Open Shared Tasks using this corpus is organized by EMNLP-BioNLP 2019, where the “Selective annotation” attribution makes AGAC track more challenging than other traditional sequence labeling tasks. In this work, we show our methods for trigger word detection (Task 1) and its thematic role identification (Task 2) in the AGAC track. As a step forward to drug repurposing research, our work can also be applied to large-scale automatic extraction of medical text knowledge. METHODS: To meet the challenges of the two tasks, we consider Task 1 as the medical name entity recognition (NER), which cultivates molecular phenomena related to gene mutation. And we regard Task 2 as a relation extraction task, which captures the thematic roles between entities. In this work, we exploit pre-trained biomedical language representation models (e.g., BioBERT) in the information extraction pipeline for mutation-disease knowledge collection from PubMed. Moreover, we design the fine-tuning framework by using a multi-task learning technique and extra features. We further investigate different approaches to consolidate and transfer the knowledge from varying sources and illustrate the performance of our model on the AGAC corpus. Our approach is based on fine-tuned BERT, BioBERT, NCBI BERT, and ClinicalBERT using multi-task learning. Further experiments show the effectiveness of knowledge transformation and the ensemble integration of models of two tasks. We conduct a performance comparison of various algorithms. We also do an ablation study on the development set of Task 1 to examine the effectiveness of each component of our method. RESULTS: Compared with competitor methods, our model obtained the highest Precision (0.63), Recall (0.56), and F-score value (0.60) in Task 1, which ranks first place. It outperformed the baseline method provided by the organizers by 0.10 in F-score. The model shared the same encoding layers for the named entity recognition and relation extraction parts. And we obtained a second high F-score (0.25) in Task 2 with a simple but effective framework. CONCLUSIONS: Experimental results on the benchmark annotation of genes with active mutation-centric function changes corpus show that integrating pre-trained biomedical language representation models (i.e., BERT, NCBI BERT, ClinicalBERT, BioBERT) into a pipe of information extraction methods with multi-task learning can improve the ability to collect mutation-disease knowledge from PubMed.
format	Online Article Text
id	pubmed-8596901
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-85969012021-11-17 Drug knowledge discovery via multi-task learning and pre-trained models Li, Dongfang Xiong, Ying Hu, Baotian Tang, Buzhou Peng, Weihua Chen, Qingcai BMC Med Inform Decis Mak Research BACKGROUND: Drug repurposing is to find new indications of approved drugs, which is essential for investigating new uses for approved or investigational drug efficiency. The active gene annotation corpus (named AGAC) is annotated by human experts, which was developed to support knowledge discovery for drug repurposing. The AGAC track of the BioNLP Open Shared Tasks using this corpus is organized by EMNLP-BioNLP 2019, where the “Selective annotation” attribution makes AGAC track more challenging than other traditional sequence labeling tasks. In this work, we show our methods for trigger word detection (Task 1) and its thematic role identification (Task 2) in the AGAC track. As a step forward to drug repurposing research, our work can also be applied to large-scale automatic extraction of medical text knowledge. METHODS: To meet the challenges of the two tasks, we consider Task 1 as the medical name entity recognition (NER), which cultivates molecular phenomena related to gene mutation. And we regard Task 2 as a relation extraction task, which captures the thematic roles between entities. In this work, we exploit pre-trained biomedical language representation models (e.g., BioBERT) in the information extraction pipeline for mutation-disease knowledge collection from PubMed. Moreover, we design the fine-tuning framework by using a multi-task learning technique and extra features. We further investigate different approaches to consolidate and transfer the knowledge from varying sources and illustrate the performance of our model on the AGAC corpus. Our approach is based on fine-tuned BERT, BioBERT, NCBI BERT, and ClinicalBERT using multi-task learning. Further experiments show the effectiveness of knowledge transformation and the ensemble integration of models of two tasks. We conduct a performance comparison of various algorithms. We also do an ablation study on the development set of Task 1 to examine the effectiveness of each component of our method. RESULTS: Compared with competitor methods, our model obtained the highest Precision (0.63), Recall (0.56), and F-score value (0.60) in Task 1, which ranks first place. It outperformed the baseline method provided by the organizers by 0.10 in F-score. The model shared the same encoding layers for the named entity recognition and relation extraction parts. And we obtained a second high F-score (0.25) in Task 2 with a simple but effective framework. CONCLUSIONS: Experimental results on the benchmark annotation of genes with active mutation-centric function changes corpus show that integrating pre-trained biomedical language representation models (i.e., BERT, NCBI BERT, ClinicalBERT, BioBERT) into a pipe of information extraction methods with multi-task learning can improve the ability to collect mutation-disease knowledge from PubMed. BioMed Central 2021-11-16 /pmc/articles/PMC8596901/ /pubmed/34789238 http://dx.doi.org/10.1186/s12911-021-01614-7 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Research Li, Dongfang Xiong, Ying Hu, Baotian Tang, Buzhou Peng, Weihua Chen, Qingcai Drug knowledge discovery via multi-task learning and pre-trained models
title	Drug knowledge discovery via multi-task learning and pre-trained models
title_full	Drug knowledge discovery via multi-task learning and pre-trained models
title_fullStr	Drug knowledge discovery via multi-task learning and pre-trained models
title_full_unstemmed	Drug knowledge discovery via multi-task learning and pre-trained models
title_short	Drug knowledge discovery via multi-task learning and pre-trained models
title_sort	drug knowledge discovery via multi-task learning and pre-trained models
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8596901/ https://www.ncbi.nlm.nih.gov/pubmed/34789238 http://dx.doi.org/10.1186/s12911-021-01614-7
work_keys_str_mv	AT lidongfang drugknowledgediscoveryviamultitasklearningandpretrainedmodels AT xiongying drugknowledgediscoveryviamultitasklearningandpretrainedmodels AT hubaotian drugknowledgediscoveryviamultitasklearningandpretrainedmodels AT tangbuzhou drugknowledgediscoveryviamultitasklearningandpretrainedmodels AT pengweihua drugknowledgediscoveryviamultitasklearningandpretrainedmodels AT chenqingcai drugknowledgediscoveryviamultitasklearningandpretrainedmodels

Drug knowledge discovery via multi-task learning and pre-trained models

Ejemplares similares