Cargando…
Pre-trained models, data augmentation, and ensemble learning for biomedical information extraction and document classification
Large volumes of publications are being produced in biomedical sciences nowadays with ever-increasing speed. To deal with the large amount of unstructured text data, effective natural language processing (NLP) methods need to be developed for various tasks such as document classification and informa...
Autores principales: | , , , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9375052/ https://www.ncbi.nlm.nih.gov/pubmed/35962559 http://dx.doi.org/10.1093/database/baac066 |
_version_ | 1784767876599644160 |
---|---|
author | Erdengasileng, Arslan Han, Qing Zhao, Tingting Tian, Shubo Sui, Xin Li, Keqiao Wang, Wanjing Wang, Jian Hu, Ting Pan, Feng Zhang, Yuan Zhang, Jinfeng |
author_facet | Erdengasileng, Arslan Han, Qing Zhao, Tingting Tian, Shubo Sui, Xin Li, Keqiao Wang, Wanjing Wang, Jian Hu, Ting Pan, Feng Zhang, Yuan Zhang, Jinfeng |
author_sort | Erdengasileng, Arslan |
collection | PubMed |
description | Large volumes of publications are being produced in biomedical sciences nowadays with ever-increasing speed. To deal with the large amount of unstructured text data, effective natural language processing (NLP) methods need to be developed for various tasks such as document classification and information extraction. BioCreative Challenge was established to evaluate the effectiveness of information extraction methods in biomedical domain and facilitate their development as a community-wide effort. In this paper, we summarize our work and what we have learned from the latest round, BioCreative Challenge VII, where we participated in all five tracks. Overall, we found three key components for achieving high performance across a variety of NLP tasks: (1) pre-trained NLP models; (2) data augmentation strategies and (3) ensemble modelling. These three strategies need to be tailored towards the specific tasks at hands to achieve high-performing baseline models, which are usually good enough for practical applications. When further combined with task-specific methods, additional improvements (usually rather small) can be achieved, which might be critical for winning competitions. Database URL: https://doi.org/10.1093/database/baac066 |
format | Online Article Text |
id | pubmed-9375052 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-93750522022-08-15 Pre-trained models, data augmentation, and ensemble learning for biomedical information extraction and document classification Erdengasileng, Arslan Han, Qing Zhao, Tingting Tian, Shubo Sui, Xin Li, Keqiao Wang, Wanjing Wang, Jian Hu, Ting Pan, Feng Zhang, Yuan Zhang, Jinfeng Database (Oxford) Original Article Large volumes of publications are being produced in biomedical sciences nowadays with ever-increasing speed. To deal with the large amount of unstructured text data, effective natural language processing (NLP) methods need to be developed for various tasks such as document classification and information extraction. BioCreative Challenge was established to evaluate the effectiveness of information extraction methods in biomedical domain and facilitate their development as a community-wide effort. In this paper, we summarize our work and what we have learned from the latest round, BioCreative Challenge VII, where we participated in all five tracks. Overall, we found three key components for achieving high performance across a variety of NLP tasks: (1) pre-trained NLP models; (2) data augmentation strategies and (3) ensemble modelling. These three strategies need to be tailored towards the specific tasks at hands to achieve high-performing baseline models, which are usually good enough for practical applications. When further combined with task-specific methods, additional improvements (usually rather small) can be achieved, which might be critical for winning competitions. Database URL: https://doi.org/10.1093/database/baac066 Oxford University Press 2022-08-13 /pmc/articles/PMC9375052/ /pubmed/35962559 http://dx.doi.org/10.1093/database/baac066 Text en © The Author(s) 2022. Published by Oxford University Press. https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com |
spellingShingle | Original Article Erdengasileng, Arslan Han, Qing Zhao, Tingting Tian, Shubo Sui, Xin Li, Keqiao Wang, Wanjing Wang, Jian Hu, Ting Pan, Feng Zhang, Yuan Zhang, Jinfeng Pre-trained models, data augmentation, and ensemble learning for biomedical information extraction and document classification |
title | Pre-trained models, data augmentation, and ensemble learning for biomedical information extraction and document classification |
title_full | Pre-trained models, data augmentation, and ensemble learning for biomedical information extraction and document classification |
title_fullStr | Pre-trained models, data augmentation, and ensemble learning for biomedical information extraction and document classification |
title_full_unstemmed | Pre-trained models, data augmentation, and ensemble learning for biomedical information extraction and document classification |
title_short | Pre-trained models, data augmentation, and ensemble learning for biomedical information extraction and document classification |
title_sort | pre-trained models, data augmentation, and ensemble learning for biomedical information extraction and document classification |
topic | Original Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9375052/ https://www.ncbi.nlm.nih.gov/pubmed/35962559 http://dx.doi.org/10.1093/database/baac066 |
work_keys_str_mv | AT erdengasilengarslan pretrainedmodelsdataaugmentationandensemblelearningforbiomedicalinformationextractionanddocumentclassification AT hanqing pretrainedmodelsdataaugmentationandensemblelearningforbiomedicalinformationextractionanddocumentclassification AT zhaotingting pretrainedmodelsdataaugmentationandensemblelearningforbiomedicalinformationextractionanddocumentclassification AT tianshubo pretrainedmodelsdataaugmentationandensemblelearningforbiomedicalinformationextractionanddocumentclassification AT suixin pretrainedmodelsdataaugmentationandensemblelearningforbiomedicalinformationextractionanddocumentclassification AT likeqiao pretrainedmodelsdataaugmentationandensemblelearningforbiomedicalinformationextractionanddocumentclassification AT wangwanjing pretrainedmodelsdataaugmentationandensemblelearningforbiomedicalinformationextractionanddocumentclassification AT wangjian pretrainedmodelsdataaugmentationandensemblelearningforbiomedicalinformationextractionanddocumentclassification AT huting pretrainedmodelsdataaugmentationandensemblelearningforbiomedicalinformationextractionanddocumentclassification AT panfeng pretrainedmodelsdataaugmentationandensemblelearningforbiomedicalinformationextractionanddocumentclassification AT zhangyuan pretrainedmodelsdataaugmentationandensemblelearningforbiomedicalinformationextractionanddocumentclassification AT zhangjinfeng pretrainedmodelsdataaugmentationandensemblelearningforbiomedicalinformationextractionanddocumentclassification |