Cargando…

Pre-trained models, data augmentation, and ensemble learning for biomedical information extraction and document classification

Large volumes of publications are being produced in biomedical sciences nowadays with ever-increasing speed. To deal with the large amount of unstructured text data, effective natural language processing (NLP) methods need to be developed for various tasks such as document classification and informa...

Descripción completa

Detalles Bibliográficos
Autores principales: Erdengasileng, Arslan, Han, Qing, Zhao, Tingting, Tian, Shubo, Sui, Xin, Li, Keqiao, Wang, Wanjing, Wang, Jian, Hu, Ting, Pan, Feng, Zhang, Yuan, Zhang, Jinfeng
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9375052/
https://www.ncbi.nlm.nih.gov/pubmed/35962559
http://dx.doi.org/10.1093/database/baac066
_version_ 1784767876599644160
author Erdengasileng, Arslan
Han, Qing
Zhao, Tingting
Tian, Shubo
Sui, Xin
Li, Keqiao
Wang, Wanjing
Wang, Jian
Hu, Ting
Pan, Feng
Zhang, Yuan
Zhang, Jinfeng
author_facet Erdengasileng, Arslan
Han, Qing
Zhao, Tingting
Tian, Shubo
Sui, Xin
Li, Keqiao
Wang, Wanjing
Wang, Jian
Hu, Ting
Pan, Feng
Zhang, Yuan
Zhang, Jinfeng
author_sort Erdengasileng, Arslan
collection PubMed
description Large volumes of publications are being produced in biomedical sciences nowadays with ever-increasing speed. To deal with the large amount of unstructured text data, effective natural language processing (NLP) methods need to be developed for various tasks such as document classification and information extraction. BioCreative Challenge was established to evaluate the effectiveness of information extraction methods in biomedical domain and facilitate their development as a community-wide effort. In this paper, we summarize our work and what we have learned from the latest round, BioCreative Challenge VII, where we participated in all five tracks. Overall, we found three key components for achieving high performance across a variety of NLP tasks: (1) pre-trained NLP models; (2) data augmentation strategies and (3) ensemble modelling. These three strategies need to be tailored towards the specific tasks at hands to achieve high-performing baseline models, which are usually good enough for practical applications. When further combined with task-specific methods, additional improvements (usually rather small) can be achieved, which might be critical for winning competitions. Database URL: https://doi.org/10.1093/database/baac066
format Online
Article
Text
id pubmed-9375052
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-93750522022-08-15 Pre-trained models, data augmentation, and ensemble learning for biomedical information extraction and document classification Erdengasileng, Arslan Han, Qing Zhao, Tingting Tian, Shubo Sui, Xin Li, Keqiao Wang, Wanjing Wang, Jian Hu, Ting Pan, Feng Zhang, Yuan Zhang, Jinfeng Database (Oxford) Original Article Large volumes of publications are being produced in biomedical sciences nowadays with ever-increasing speed. To deal with the large amount of unstructured text data, effective natural language processing (NLP) methods need to be developed for various tasks such as document classification and information extraction. BioCreative Challenge was established to evaluate the effectiveness of information extraction methods in biomedical domain and facilitate their development as a community-wide effort. In this paper, we summarize our work and what we have learned from the latest round, BioCreative Challenge VII, where we participated in all five tracks. Overall, we found three key components for achieving high performance across a variety of NLP tasks: (1) pre-trained NLP models; (2) data augmentation strategies and (3) ensemble modelling. These three strategies need to be tailored towards the specific tasks at hands to achieve high-performing baseline models, which are usually good enough for practical applications. When further combined with task-specific methods, additional improvements (usually rather small) can be achieved, which might be critical for winning competitions. Database URL: https://doi.org/10.1093/database/baac066 Oxford University Press 2022-08-13 /pmc/articles/PMC9375052/ /pubmed/35962559 http://dx.doi.org/10.1093/database/baac066 Text en © The Author(s) 2022. Published by Oxford University Press. https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Original Article
Erdengasileng, Arslan
Han, Qing
Zhao, Tingting
Tian, Shubo
Sui, Xin
Li, Keqiao
Wang, Wanjing
Wang, Jian
Hu, Ting
Pan, Feng
Zhang, Yuan
Zhang, Jinfeng
Pre-trained models, data augmentation, and ensemble learning for biomedical information extraction and document classification
title Pre-trained models, data augmentation, and ensemble learning for biomedical information extraction and document classification
title_full Pre-trained models, data augmentation, and ensemble learning for biomedical information extraction and document classification
title_fullStr Pre-trained models, data augmentation, and ensemble learning for biomedical information extraction and document classification
title_full_unstemmed Pre-trained models, data augmentation, and ensemble learning for biomedical information extraction and document classification
title_short Pre-trained models, data augmentation, and ensemble learning for biomedical information extraction and document classification
title_sort pre-trained models, data augmentation, and ensemble learning for biomedical information extraction and document classification
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9375052/
https://www.ncbi.nlm.nih.gov/pubmed/35962559
http://dx.doi.org/10.1093/database/baac066
work_keys_str_mv AT erdengasilengarslan pretrainedmodelsdataaugmentationandensemblelearningforbiomedicalinformationextractionanddocumentclassification
AT hanqing pretrainedmodelsdataaugmentationandensemblelearningforbiomedicalinformationextractionanddocumentclassification
AT zhaotingting pretrainedmodelsdataaugmentationandensemblelearningforbiomedicalinformationextractionanddocumentclassification
AT tianshubo pretrainedmodelsdataaugmentationandensemblelearningforbiomedicalinformationextractionanddocumentclassification
AT suixin pretrainedmodelsdataaugmentationandensemblelearningforbiomedicalinformationextractionanddocumentclassification
AT likeqiao pretrainedmodelsdataaugmentationandensemblelearningforbiomedicalinformationextractionanddocumentclassification
AT wangwanjing pretrainedmodelsdataaugmentationandensemblelearningforbiomedicalinformationextractionanddocumentclassification
AT wangjian pretrainedmodelsdataaugmentationandensemblelearningforbiomedicalinformationextractionanddocumentclassification
AT huting pretrainedmodelsdataaugmentationandensemblelearningforbiomedicalinformationextractionanddocumentclassification
AT panfeng pretrainedmodelsdataaugmentationandensemblelearningforbiomedicalinformationextractionanddocumentclassification
AT zhangyuan pretrainedmodelsdataaugmentationandensemblelearningforbiomedicalinformationextractionanddocumentclassification
AT zhangjinfeng pretrainedmodelsdataaugmentationandensemblelearningforbiomedicalinformationextractionanddocumentclassification