Cargando…

Dataset for studying gender disparity in English literary texts

Recent discourse has highlighted significant gender disparity in many aspects of economic, social and cultural life. With the advent of advanced tools in Artificial Intelligence (AI) and Natural Language Processing (NLP), there is an opportunity to use computational and digital tools to analyze corp...

Descripción completa

Detalles Bibliográficos
Autores principales: Nagaraj, Akarsh, Kejriwal, Mayank
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8842022/
https://www.ncbi.nlm.nih.gov/pubmed/35198684
http://dx.doi.org/10.1016/j.dib.2022.107905
_version_ 1784650968927830016
author Nagaraj, Akarsh
Kejriwal, Mayank
author_facet Nagaraj, Akarsh
Kejriwal, Mayank
author_sort Nagaraj, Akarsh
collection PubMed
description Recent discourse has highlighted significant gender disparity in many aspects of economic, social and cultural life. With the advent of advanced tools in Artificial Intelligence (AI) and Natural Language Processing (NLP), there is an opportunity to use computational and digital tools to analyze corpora, such as copyright-expired literature in the pre-modern period (defined herein as books published approximately between 1800 and 1950) from the Project Gutenberg corpus. Nevertheless, there are challenges in using such tools, especially for maintaining high-enough quality to explore interesting hypotheses. We present a dataset and materials that illustrate how modern processes in NLP can be used on the raw text of more than 3,000 literary texts in Project Gutenberg to (i) extract characters and pronouns from the text with high quality, (ii) disambiguate characters so that they are not overcounted, (iii) detect the gender of each character. Furthermore, we also used manual labeling to determine the genders of authors who have published these texts, and published the labels as part of the dataset to facilitate future digital humanities research.
format Online
Article
Text
id pubmed-8842022
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-88420222022-02-22 Dataset for studying gender disparity in English literary texts Nagaraj, Akarsh Kejriwal, Mayank Data Brief Data Article Recent discourse has highlighted significant gender disparity in many aspects of economic, social and cultural life. With the advent of advanced tools in Artificial Intelligence (AI) and Natural Language Processing (NLP), there is an opportunity to use computational and digital tools to analyze corpora, such as copyright-expired literature in the pre-modern period (defined herein as books published approximately between 1800 and 1950) from the Project Gutenberg corpus. Nevertheless, there are challenges in using such tools, especially for maintaining high-enough quality to explore interesting hypotheses. We present a dataset and materials that illustrate how modern processes in NLP can be used on the raw text of more than 3,000 literary texts in Project Gutenberg to (i) extract characters and pronouns from the text with high quality, (ii) disambiguate characters so that they are not overcounted, (iii) detect the gender of each character. Furthermore, we also used manual labeling to determine the genders of authors who have published these texts, and published the labels as part of the dataset to facilitate future digital humanities research. Elsevier 2022-02-02 /pmc/articles/PMC8842022/ /pubmed/35198684 http://dx.doi.org/10.1016/j.dib.2022.107905 Text en © 2022 The Author(s). Published by Elsevier Inc. https://creativecommons.org/licenses/by/4.0/This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Data Article
Nagaraj, Akarsh
Kejriwal, Mayank
Dataset for studying gender disparity in English literary texts
title Dataset for studying gender disparity in English literary texts
title_full Dataset for studying gender disparity in English literary texts
title_fullStr Dataset for studying gender disparity in English literary texts
title_full_unstemmed Dataset for studying gender disparity in English literary texts
title_short Dataset for studying gender disparity in English literary texts
title_sort dataset for studying gender disparity in english literary texts
topic Data Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8842022/
https://www.ncbi.nlm.nih.gov/pubmed/35198684
http://dx.doi.org/10.1016/j.dib.2022.107905
work_keys_str_mv AT nagarajakarsh datasetforstudyinggenderdisparityinenglishliterarytexts
AT kejriwalmayank datasetforstudyinggenderdisparityinenglishliterarytexts