Cargando…

Dataset for studying gender disparity in English literary texts

Recent discourse has highlighted significant gender disparity in many aspects of economic, social and cultural life. With the advent of advanced tools in Artificial Intelligence (AI) and Natural Language Processing (NLP), there is an opportunity to use computational and digital tools to analyze corp...

Descripción completa

Detalles Bibliográficos
Autores principales: Nagaraj, Akarsh, Kejriwal, Mayank
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8842022/
https://www.ncbi.nlm.nih.gov/pubmed/35198684
http://dx.doi.org/10.1016/j.dib.2022.107905
Descripción
Sumario:Recent discourse has highlighted significant gender disparity in many aspects of economic, social and cultural life. With the advent of advanced tools in Artificial Intelligence (AI) and Natural Language Processing (NLP), there is an opportunity to use computational and digital tools to analyze corpora, such as copyright-expired literature in the pre-modern period (defined herein as books published approximately between 1800 and 1950) from the Project Gutenberg corpus. Nevertheless, there are challenges in using such tools, especially for maintaining high-enough quality to explore interesting hypotheses. We present a dataset and materials that illustrate how modern processes in NLP can be used on the raw text of more than 3,000 literary texts in Project Gutenberg to (i) extract characters and pronouns from the text with high quality, (ii) disambiguate characters so that they are not overcounted, (iii) detect the gender of each character. Furthermore, we also used manual labeling to determine the genders of authors who have published these texts, and published the labels as part of the dataset to facilitate future digital humanities research.