Cargando…

Environmental due diligence data: A novel corpus for training environmental domain NLP models

This article takes a step in the direction of adapting existing Natural Language Processing (NLP) models to diverse and heterogeneous settings of Environmental Due Diligence (EDD). The approach we followed was to enrich the vocabulary of deep learning models with more data from environmental domain...

Descripción completa

Detalles Bibliográficos
Autores principales: Aman, Afreen, Reji, Deepak John
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9486029/
https://www.ncbi.nlm.nih.gov/pubmed/36148216
http://dx.doi.org/10.1016/j.dib.2022.108579
_version_ 1784792188013510656
author Aman, Afreen
Reji, Deepak John
author_facet Aman, Afreen
Reji, Deepak John
author_sort Aman, Afreen
collection PubMed
description This article takes a step in the direction of adapting existing Natural Language Processing (NLP) models to diverse and heterogeneous settings of Environmental Due Diligence (EDD). The approach we followed was to enrich the vocabulary of deep learning models with more data from environmental domain by collecting the data from open-source regulatory documents provided by Environmental Protection Agency (EPA) [1]. We used active learning and data augmentation methods to resolve the imbalanced classes and fine-tuned DistilBERT on EDD data to develop environmental due diligence model which is hosted as an inference Application Programming Interface (API) on Hugging Face Hub. This model was packaged to predict EDD classes, determine relevancy and ranking, and allows users to fine tune the model to more EDD classes. This package, EnvBert is hosted on Python Package Index (PyPI) repository [2]. We anticipate that the rich EDD dataset that we used to train the model and create a package would help the users contribute for a variety of NLP tasks on EDD textual data, especially for text classification purposes. We present the data in raw format; it has been open sourced and publicly available at https://data.mendeley.com/datasets/tx6vmd4g9p/4.
format Online
Article
Text
id pubmed-9486029
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-94860292022-09-21 Environmental due diligence data: A novel corpus for training environmental domain NLP models Aman, Afreen Reji, Deepak John Data Brief Data Article This article takes a step in the direction of adapting existing Natural Language Processing (NLP) models to diverse and heterogeneous settings of Environmental Due Diligence (EDD). The approach we followed was to enrich the vocabulary of deep learning models with more data from environmental domain by collecting the data from open-source regulatory documents provided by Environmental Protection Agency (EPA) [1]. We used active learning and data augmentation methods to resolve the imbalanced classes and fine-tuned DistilBERT on EDD data to develop environmental due diligence model which is hosted as an inference Application Programming Interface (API) on Hugging Face Hub. This model was packaged to predict EDD classes, determine relevancy and ranking, and allows users to fine tune the model to more EDD classes. This package, EnvBert is hosted on Python Package Index (PyPI) repository [2]. We anticipate that the rich EDD dataset that we used to train the model and create a package would help the users contribute for a variety of NLP tasks on EDD textual data, especially for text classification purposes. We present the data in raw format; it has been open sourced and publicly available at https://data.mendeley.com/datasets/tx6vmd4g9p/4. Elsevier 2022-09-07 /pmc/articles/PMC9486029/ /pubmed/36148216 http://dx.doi.org/10.1016/j.dib.2022.108579 Text en © 2022 The Author(s). Published by Elsevier Inc. https://creativecommons.org/licenses/by-nc-nd/4.0/This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
spellingShingle Data Article
Aman, Afreen
Reji, Deepak John
Environmental due diligence data: A novel corpus for training environmental domain NLP models
title Environmental due diligence data: A novel corpus for training environmental domain NLP models
title_full Environmental due diligence data: A novel corpus for training environmental domain NLP models
title_fullStr Environmental due diligence data: A novel corpus for training environmental domain NLP models
title_full_unstemmed Environmental due diligence data: A novel corpus for training environmental domain NLP models
title_short Environmental due diligence data: A novel corpus for training environmental domain NLP models
title_sort environmental due diligence data: a novel corpus for training environmental domain nlp models
topic Data Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9486029/
https://www.ncbi.nlm.nih.gov/pubmed/36148216
http://dx.doi.org/10.1016/j.dib.2022.108579
work_keys_str_mv AT amanafreen environmentalduediligencedataanovelcorpusfortrainingenvironmentaldomainnlpmodels
AT rejideepakjohn environmentalduediligencedataanovelcorpusfortrainingenvironmentaldomainnlpmodels