Cargando…

RENET2: high-performance full-text gene–disease relation extraction with iterative training data expansion

Relation extraction (RE) is a fundamental task for extracting gene–disease associations from biomedical text. Many state-of-the-art tools have limited capacity, as they can extract gene–disease associations only from single sentences or abstract texts. A few studies have explored extracting gene–dis...

Descripción completa

Detalles Bibliográficos
Autores principales: Su, Junhao, Wu, Ye, Ting, Hing-Fung, Lam, Tak-Wah, Luo, Ruibang
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8256824/
https://www.ncbi.nlm.nih.gov/pubmed/34235433
http://dx.doi.org/10.1093/nargab/lqab062
_version_ 1783718174953308160
author Su, Junhao
Wu, Ye
Ting, Hing-Fung
Lam, Tak-Wah
Luo, Ruibang
author_facet Su, Junhao
Wu, Ye
Ting, Hing-Fung
Lam, Tak-Wah
Luo, Ruibang
author_sort Su, Junhao
collection PubMed
description Relation extraction (RE) is a fundamental task for extracting gene–disease associations from biomedical text. Many state-of-the-art tools have limited capacity, as they can extract gene–disease associations only from single sentences or abstract texts. A few studies have explored extracting gene–disease associations from full-text articles, but there exists a large room for improvements. In this work, we propose RENET2, a deep learning-based RE method, which implements Section Filtering and ambiguous relations modeling to extract gene–disease associations from full-text articles. We designed a novel iterative training data expansion strategy to build an annotated full-text dataset to resolve the scarcity of labels on full-text articles. In our experiments, RENET2 achieved an F1-score of 72.13% for extracting gene–disease associations from an annotated full-text dataset, which was 27.22, 30.30, 29.24 and 23.87% higher than BeFree, DTMiner, BioBERT and RENET, respectively. We applied RENET2 to (i) ∼1.89M full-text articles from PubMed Central and found ∼3.72M gene–disease associations; and (ii) the LitCovid articles and ranked the top 15 proteins associated with COVID-19, supported by recent articles. RENET2 is an efficient and accurate method for full-text gene–disease association extraction. The source-code, manually curated abstract/full-text training data, and results of RENET2 are available at GitHub.
format Online
Article
Text
id pubmed-8256824
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-82568242021-07-06 RENET2: high-performance full-text gene–disease relation extraction with iterative training data expansion Su, Junhao Wu, Ye Ting, Hing-Fung Lam, Tak-Wah Luo, Ruibang NAR Genom Bioinform Methart Relation extraction (RE) is a fundamental task for extracting gene–disease associations from biomedical text. Many state-of-the-art tools have limited capacity, as they can extract gene–disease associations only from single sentences or abstract texts. A few studies have explored extracting gene–disease associations from full-text articles, but there exists a large room for improvements. In this work, we propose RENET2, a deep learning-based RE method, which implements Section Filtering and ambiguous relations modeling to extract gene–disease associations from full-text articles. We designed a novel iterative training data expansion strategy to build an annotated full-text dataset to resolve the scarcity of labels on full-text articles. In our experiments, RENET2 achieved an F1-score of 72.13% for extracting gene–disease associations from an annotated full-text dataset, which was 27.22, 30.30, 29.24 and 23.87% higher than BeFree, DTMiner, BioBERT and RENET, respectively. We applied RENET2 to (i) ∼1.89M full-text articles from PubMed Central and found ∼3.72M gene–disease associations; and (ii) the LitCovid articles and ranked the top 15 proteins associated with COVID-19, supported by recent articles. RENET2 is an efficient and accurate method for full-text gene–disease association extraction. The source-code, manually curated abstract/full-text training data, and results of RENET2 are available at GitHub. Oxford University Press 2021-07-05 /pmc/articles/PMC8256824/ /pubmed/34235433 http://dx.doi.org/10.1093/nargab/lqab062 Text en © The Author(s) 2021. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) ), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methart
Su, Junhao
Wu, Ye
Ting, Hing-Fung
Lam, Tak-Wah
Luo, Ruibang
RENET2: high-performance full-text gene–disease relation extraction with iterative training data expansion
title RENET2: high-performance full-text gene–disease relation extraction with iterative training data expansion
title_full RENET2: high-performance full-text gene–disease relation extraction with iterative training data expansion
title_fullStr RENET2: high-performance full-text gene–disease relation extraction with iterative training data expansion
title_full_unstemmed RENET2: high-performance full-text gene–disease relation extraction with iterative training data expansion
title_short RENET2: high-performance full-text gene–disease relation extraction with iterative training data expansion
title_sort renet2: high-performance full-text gene–disease relation extraction with iterative training data expansion
topic Methart
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8256824/
https://www.ncbi.nlm.nih.gov/pubmed/34235433
http://dx.doi.org/10.1093/nargab/lqab062
work_keys_str_mv AT sujunhao renet2highperformancefulltextgenediseaserelationextractionwithiterativetrainingdataexpansion
AT wuye renet2highperformancefulltextgenediseaserelationextractionwithiterativetrainingdataexpansion
AT tinghingfung renet2highperformancefulltextgenediseaserelationextractionwithiterativetrainingdataexpansion
AT lamtakwah renet2highperformancefulltextgenediseaserelationextractionwithiterativetrainingdataexpansion
AT luoruibang renet2highperformancefulltextgenediseaserelationextractionwithiterativetrainingdataexpansion