Cargando…

Improving text mining in plant health domain with GAN and/or pre-trained language model

The Bidirectional Encoder Representations from Transformers (BERT) architecture offers a cutting-edge approach to Natural Language Processing. It involves two steps: 1) pre-training a language model to extract contextualized features and 2) fine-tuning for specific downstream tasks. Although pre-tra...

Descripción completa

Detalles Bibliográficos
Autores principales: Jiang, Shufan, Cormier, Stéphane, Angarita, Rafael, Rousseaux, Francis
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9989305/
https://www.ncbi.nlm.nih.gov/pubmed/36895200
http://dx.doi.org/10.3389/frai.2023.1072329
_version_ 1784901742928855040
author Jiang, Shufan
Cormier, Stéphane
Angarita, Rafael
Rousseaux, Francis
author_facet Jiang, Shufan
Cormier, Stéphane
Angarita, Rafael
Rousseaux, Francis
author_sort Jiang, Shufan
collection PubMed
description The Bidirectional Encoder Representations from Transformers (BERT) architecture offers a cutting-edge approach to Natural Language Processing. It involves two steps: 1) pre-training a language model to extract contextualized features and 2) fine-tuning for specific downstream tasks. Although pre-trained language models (PLMs) have been successful in various text-mining applications, challenges remain, particularly in areas with limited labeled data such as plant health hazard detection from individuals' observations. To address this challenge, we propose to combine GAN-BERT, a model that extends the fine-tuning process with unlabeled data through a Generative Adversarial Network (GAN), with ChouBERT, a domain-specific PLM. Our results show that GAN-BERT outperforms traditional fine-tuning in multiple text classification tasks. In this paper, we examine the impact of further pre-training on the GAN-BERT model. We experiment with different hyper parameters to determine the best combination of models and fine-tuning parameters. Our findings suggest that the combination of GAN and ChouBERT can enhance the generalizability of the text classifier but may also lead to increased instability during training. Finally, we provide recommendations to mitigate these instabilities.
format Online
Article
Text
id pubmed-9989305
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-99893052023-03-08 Improving text mining in plant health domain with GAN and/or pre-trained language model Jiang, Shufan Cormier, Stéphane Angarita, Rafael Rousseaux, Francis Front Artif Intell Artificial Intelligence The Bidirectional Encoder Representations from Transformers (BERT) architecture offers a cutting-edge approach to Natural Language Processing. It involves two steps: 1) pre-training a language model to extract contextualized features and 2) fine-tuning for specific downstream tasks. Although pre-trained language models (PLMs) have been successful in various text-mining applications, challenges remain, particularly in areas with limited labeled data such as plant health hazard detection from individuals' observations. To address this challenge, we propose to combine GAN-BERT, a model that extends the fine-tuning process with unlabeled data through a Generative Adversarial Network (GAN), with ChouBERT, a domain-specific PLM. Our results show that GAN-BERT outperforms traditional fine-tuning in multiple text classification tasks. In this paper, we examine the impact of further pre-training on the GAN-BERT model. We experiment with different hyper parameters to determine the best combination of models and fine-tuning parameters. Our findings suggest that the combination of GAN and ChouBERT can enhance the generalizability of the text classifier but may also lead to increased instability during training. Finally, we provide recommendations to mitigate these instabilities. Frontiers Media S.A. 2023-02-21 /pmc/articles/PMC9989305/ /pubmed/36895200 http://dx.doi.org/10.3389/frai.2023.1072329 Text en Copyright © 2023 Jiang, Cormier, Angarita and Rousseaux. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Artificial Intelligence
Jiang, Shufan
Cormier, Stéphane
Angarita, Rafael
Rousseaux, Francis
Improving text mining in plant health domain with GAN and/or pre-trained language model
title Improving text mining in plant health domain with GAN and/or pre-trained language model
title_full Improving text mining in plant health domain with GAN and/or pre-trained language model
title_fullStr Improving text mining in plant health domain with GAN and/or pre-trained language model
title_full_unstemmed Improving text mining in plant health domain with GAN and/or pre-trained language model
title_short Improving text mining in plant health domain with GAN and/or pre-trained language model
title_sort improving text mining in plant health domain with gan and/or pre-trained language model
topic Artificial Intelligence
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9989305/
https://www.ncbi.nlm.nih.gov/pubmed/36895200
http://dx.doi.org/10.3389/frai.2023.1072329
work_keys_str_mv AT jiangshufan improvingtextmininginplanthealthdomainwithganandorpretrainedlanguagemodel
AT cormierstephane improvingtextmininginplanthealthdomainwithganandorpretrainedlanguagemodel
AT angaritarafael improvingtextmininginplanthealthdomainwithganandorpretrainedlanguagemodel
AT rousseauxfrancis improvingtextmininginplanthealthdomainwithganandorpretrainedlanguagemodel