Cargando…
Dataset for studying gender disparity in English literary texts
Recent discourse has highlighted significant gender disparity in many aspects of economic, social and cultural life. With the advent of advanced tools in Artificial Intelligence (AI) and Natural Language Processing (NLP), there is an opportunity to use computational and digital tools to analyze corp...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Elsevier
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8842022/ https://www.ncbi.nlm.nih.gov/pubmed/35198684 http://dx.doi.org/10.1016/j.dib.2022.107905 |
_version_ | 1784650968927830016 |
---|---|
author | Nagaraj, Akarsh Kejriwal, Mayank |
author_facet | Nagaraj, Akarsh Kejriwal, Mayank |
author_sort | Nagaraj, Akarsh |
collection | PubMed |
description | Recent discourse has highlighted significant gender disparity in many aspects of economic, social and cultural life. With the advent of advanced tools in Artificial Intelligence (AI) and Natural Language Processing (NLP), there is an opportunity to use computational and digital tools to analyze corpora, such as copyright-expired literature in the pre-modern period (defined herein as books published approximately between 1800 and 1950) from the Project Gutenberg corpus. Nevertheless, there are challenges in using such tools, especially for maintaining high-enough quality to explore interesting hypotheses. We present a dataset and materials that illustrate how modern processes in NLP can be used on the raw text of more than 3,000 literary texts in Project Gutenberg to (i) extract characters and pronouns from the text with high quality, (ii) disambiguate characters so that they are not overcounted, (iii) detect the gender of each character. Furthermore, we also used manual labeling to determine the genders of authors who have published these texts, and published the labels as part of the dataset to facilitate future digital humanities research. |
format | Online Article Text |
id | pubmed-8842022 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Elsevier |
record_format | MEDLINE/PubMed |
spelling | pubmed-88420222022-02-22 Dataset for studying gender disparity in English literary texts Nagaraj, Akarsh Kejriwal, Mayank Data Brief Data Article Recent discourse has highlighted significant gender disparity in many aspects of economic, social and cultural life. With the advent of advanced tools in Artificial Intelligence (AI) and Natural Language Processing (NLP), there is an opportunity to use computational and digital tools to analyze corpora, such as copyright-expired literature in the pre-modern period (defined herein as books published approximately between 1800 and 1950) from the Project Gutenberg corpus. Nevertheless, there are challenges in using such tools, especially for maintaining high-enough quality to explore interesting hypotheses. We present a dataset and materials that illustrate how modern processes in NLP can be used on the raw text of more than 3,000 literary texts in Project Gutenberg to (i) extract characters and pronouns from the text with high quality, (ii) disambiguate characters so that they are not overcounted, (iii) detect the gender of each character. Furthermore, we also used manual labeling to determine the genders of authors who have published these texts, and published the labels as part of the dataset to facilitate future digital humanities research. Elsevier 2022-02-02 /pmc/articles/PMC8842022/ /pubmed/35198684 http://dx.doi.org/10.1016/j.dib.2022.107905 Text en © 2022 The Author(s). Published by Elsevier Inc. https://creativecommons.org/licenses/by/4.0/This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Data Article Nagaraj, Akarsh Kejriwal, Mayank Dataset for studying gender disparity in English literary texts |
title | Dataset for studying gender disparity in English literary texts |
title_full | Dataset for studying gender disparity in English literary texts |
title_fullStr | Dataset for studying gender disparity in English literary texts |
title_full_unstemmed | Dataset for studying gender disparity in English literary texts |
title_short | Dataset for studying gender disparity in English literary texts |
title_sort | dataset for studying gender disparity in english literary texts |
topic | Data Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8842022/ https://www.ncbi.nlm.nih.gov/pubmed/35198684 http://dx.doi.org/10.1016/j.dib.2022.107905 |
work_keys_str_mv | AT nagarajakarsh datasetforstudyinggenderdisparityinenglishliterarytexts AT kejriwalmayank datasetforstudyinggenderdisparityinenglishliterarytexts |