Cargando…

Using genderize.io to infer the gender of first names: how to improve the accuracy of the inference

OBJECTIVE: We recently showed that genderize.io is not a sufficiently powerful gender detection tool due to a large number of nonclassifications. In the present study, we aimed to assess whether the accuracy of inference by genderize.io can be improved by manipulating the first names in the database...

Descripción completa

Detalles Bibliográficos
Autor principal:	Sebo, Paul
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	University Library System, University of Pittsburgh 2021
Materias:	Original Investigation
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8608220/ https://www.ncbi.nlm.nih.gov/pubmed/34858090 http://dx.doi.org/10.5195/jmla.2021.1252

_version_	1784602709096136704
author	Sebo, Paul
author_facet	Sebo, Paul
author_sort	Sebo, Paul
collection	PubMed
description	OBJECTIVE: We recently showed that genderize.io is not a sufficiently powerful gender detection tool due to a large number of nonclassifications. In the present study, we aimed to assess whether the accuracy of inference by genderize.io can be improved by manipulating the first names in the database. METHODS: We used a database containing the first names, surnames, and gender of 6,131 physicians practicing in a multicultural country (Switzerland). We uploaded the original CSV file (file #1), the file obtained after removing all diacritic marks, such as accents and cedilla (file #2), and the file obtained after removing all diacritic marks and retaining only the first term of the compound first names (file #3). For each file, we computed three performance metrics: proportion of misclassifications (errorCodedWithoutNA), proportion of nonclassifications (naCoded), and proportion of misclassifications and nonclassifications (errorCoded). RESULTS: naCoded, which was high for file #1 (16.4%), was reduced after data manipulation (file #2: 11.7%, file #3: 0.4%). As the increase in the number of misclassifications was small, the overall performance of genderize.io (i.e., errorCoded) improved, especially for file #3 (file #1: 17.7%, file #2: 13.0%, and file #3: 2.3%). CONCLUSIONS: A relatively simple manipulation of the data improved the accuracy of gender inference by genderize.io. We recommend using genderize.io only with files that were modified in this way.
format	Online Article Text
id	pubmed-8608220
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	University Library System, University of Pittsburgh
record_format	MEDLINE/PubMed
spelling	pubmed-86082202021-12-01 Using genderize.io to infer the gender of first names: how to improve the accuracy of the inference Sebo, Paul J Med Libr Assoc Original Investigation OBJECTIVE: We recently showed that genderize.io is not a sufficiently powerful gender detection tool due to a large number of nonclassifications. In the present study, we aimed to assess whether the accuracy of inference by genderize.io can be improved by manipulating the first names in the database. METHODS: We used a database containing the first names, surnames, and gender of 6,131 physicians practicing in a multicultural country (Switzerland). We uploaded the original CSV file (file #1), the file obtained after removing all diacritic marks, such as accents and cedilla (file #2), and the file obtained after removing all diacritic marks and retaining only the first term of the compound first names (file #3). For each file, we computed three performance metrics: proportion of misclassifications (errorCodedWithoutNA), proportion of nonclassifications (naCoded), and proportion of misclassifications and nonclassifications (errorCoded). RESULTS: naCoded, which was high for file #1 (16.4%), was reduced after data manipulation (file #2: 11.7%, file #3: 0.4%). As the increase in the number of misclassifications was small, the overall performance of genderize.io (i.e., errorCoded) improved, especially for file #3 (file #1: 17.7%, file #2: 13.0%, and file #3: 2.3%). CONCLUSIONS: A relatively simple manipulation of the data improved the accuracy of gender inference by genderize.io. We recommend using genderize.io only with files that were modified in this way. University Library System, University of Pittsburgh 2021-10-01 2021-10-01 /pmc/articles/PMC8608220/ /pubmed/34858090 http://dx.doi.org/10.5195/jmla.2021.1252 Text en Copyright © 2021 Paul Sebo https://creativecommons.org/licenses/by/4.0/This work is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle	Original Investigation Sebo, Paul Using genderize.io to infer the gender of first names: how to improve the accuracy of the inference
title	Using genderize.io to infer the gender of first names: how to improve the accuracy of the inference
title_full	Using genderize.io to infer the gender of first names: how to improve the accuracy of the inference
title_fullStr	Using genderize.io to infer the gender of first names: how to improve the accuracy of the inference
title_full_unstemmed	Using genderize.io to infer the gender of first names: how to improve the accuracy of the inference
title_short	Using genderize.io to infer the gender of first names: how to improve the accuracy of the inference
title_sort	using genderize.io to infer the gender of first names: how to improve the accuracy of the inference
topic	Original Investigation
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8608220/ https://www.ncbi.nlm.nih.gov/pubmed/34858090 http://dx.doi.org/10.5195/jmla.2021.1252
work_keys_str_mv	AT sebopaul usinggenderizeiotoinferthegenderoffirstnameshowtoimprovetheaccuracyoftheinference

Using genderize.io to infer the gender of first names: how to improve the accuracy of the inference

Ejemplares similares