Cargando…
Environmental due diligence data: A novel corpus for training environmental domain NLP models
This article takes a step in the direction of adapting existing Natural Language Processing (NLP) models to diverse and heterogeneous settings of Environmental Due Diligence (EDD). The approach we followed was to enrich the vocabulary of deep learning models with more data from environmental domain...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Elsevier
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9486029/ https://www.ncbi.nlm.nih.gov/pubmed/36148216 http://dx.doi.org/10.1016/j.dib.2022.108579 |
_version_ | 1784792188013510656 |
---|---|
author | Aman, Afreen Reji, Deepak John |
author_facet | Aman, Afreen Reji, Deepak John |
author_sort | Aman, Afreen |
collection | PubMed |
description | This article takes a step in the direction of adapting existing Natural Language Processing (NLP) models to diverse and heterogeneous settings of Environmental Due Diligence (EDD). The approach we followed was to enrich the vocabulary of deep learning models with more data from environmental domain by collecting the data from open-source regulatory documents provided by Environmental Protection Agency (EPA) [1]. We used active learning and data augmentation methods to resolve the imbalanced classes and fine-tuned DistilBERT on EDD data to develop environmental due diligence model which is hosted as an inference Application Programming Interface (API) on Hugging Face Hub. This model was packaged to predict EDD classes, determine relevancy and ranking, and allows users to fine tune the model to more EDD classes. This package, EnvBert is hosted on Python Package Index (PyPI) repository [2]. We anticipate that the rich EDD dataset that we used to train the model and create a package would help the users contribute for a variety of NLP tasks on EDD textual data, especially for text classification purposes. We present the data in raw format; it has been open sourced and publicly available at https://data.mendeley.com/datasets/tx6vmd4g9p/4. |
format | Online Article Text |
id | pubmed-9486029 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Elsevier |
record_format | MEDLINE/PubMed |
spelling | pubmed-94860292022-09-21 Environmental due diligence data: A novel corpus for training environmental domain NLP models Aman, Afreen Reji, Deepak John Data Brief Data Article This article takes a step in the direction of adapting existing Natural Language Processing (NLP) models to diverse and heterogeneous settings of Environmental Due Diligence (EDD). The approach we followed was to enrich the vocabulary of deep learning models with more data from environmental domain by collecting the data from open-source regulatory documents provided by Environmental Protection Agency (EPA) [1]. We used active learning and data augmentation methods to resolve the imbalanced classes and fine-tuned DistilBERT on EDD data to develop environmental due diligence model which is hosted as an inference Application Programming Interface (API) on Hugging Face Hub. This model was packaged to predict EDD classes, determine relevancy and ranking, and allows users to fine tune the model to more EDD classes. This package, EnvBert is hosted on Python Package Index (PyPI) repository [2]. We anticipate that the rich EDD dataset that we used to train the model and create a package would help the users contribute for a variety of NLP tasks on EDD textual data, especially for text classification purposes. We present the data in raw format; it has been open sourced and publicly available at https://data.mendeley.com/datasets/tx6vmd4g9p/4. Elsevier 2022-09-07 /pmc/articles/PMC9486029/ /pubmed/36148216 http://dx.doi.org/10.1016/j.dib.2022.108579 Text en © 2022 The Author(s). Published by Elsevier Inc. https://creativecommons.org/licenses/by-nc-nd/4.0/This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). |
spellingShingle | Data Article Aman, Afreen Reji, Deepak John Environmental due diligence data: A novel corpus for training environmental domain NLP models |
title | Environmental due diligence data: A novel corpus for training environmental domain NLP models |
title_full | Environmental due diligence data: A novel corpus for training environmental domain NLP models |
title_fullStr | Environmental due diligence data: A novel corpus for training environmental domain NLP models |
title_full_unstemmed | Environmental due diligence data: A novel corpus for training environmental domain NLP models |
title_short | Environmental due diligence data: A novel corpus for training environmental domain NLP models |
title_sort | environmental due diligence data: a novel corpus for training environmental domain nlp models |
topic | Data Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9486029/ https://www.ncbi.nlm.nih.gov/pubmed/36148216 http://dx.doi.org/10.1016/j.dib.2022.108579 |
work_keys_str_mv | AT amanafreen environmentalduediligencedataanovelcorpusfortrainingenvironmentaldomainnlpmodels AT rejideepakjohn environmentalduediligencedataanovelcorpusfortrainingenvironmentaldomainnlpmodels |