Cargando…

How well does NamSor perform in predicting the country of origin and ethnicity of individuals based on their first and last names?

BACKGROUND: We aimed to evaluate NamSor’s performance in predicting the country of origin and ethnicity of individuals based on their first/last names. METHODS: We retrieved the name and country of affiliation of all authors of PubMed publications in 2021, affiliated with universities in the twenty-...

Descripción completa

Detalles Bibliográficos
Autor principal: Sebo, Paul
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10653483/
https://www.ncbi.nlm.nih.gov/pubmed/37972002
http://dx.doi.org/10.1371/journal.pone.0294562
_version_ 1785147786753212416
author Sebo, Paul
author_facet Sebo, Paul
author_sort Sebo, Paul
collection PubMed
description BACKGROUND: We aimed to evaluate NamSor’s performance in predicting the country of origin and ethnicity of individuals based on their first/last names. METHODS: We retrieved the name and country of affiliation of all authors of PubMed publications in 2021, affiliated with universities in the twenty-two countries whose researchers authored ≥1,000 medical publications and whose percentage of migrants was <2.5% (N = 88,699). We estimated with NamSor their most likely "continent of origin" (Asia/Africa/Europe), "country of origin" and "ethnicity". We also examined two other variables that we created: “continent#2” ("Europe" replaced by "Europe/America/Oceania") and “country#2” ("Spain" replaced by “Spain/Hispanic American country” and "Portugal" replaced by "Portugal/Brazil"). Using "country of affiliation" as a proxy for "country of origin", we calculated for these five variables the proportion of misclassifications (= errorCodedWithoutNA) and the proportion of non-classifications (= naCoded). We repeated the analyses with a subsample consisting of all results with inference accuracy ≥50%. RESULTS: For the full sample and the subsample, errorCodedWithoutNA was 16.0% and 12.6% for “continent”, 6.3% and 3.3% for “continent#2”, 27.3% and 19.5% for “country”, 19.7% and 11.4% for “country#2”, and 20.2% and 14.8% for “ethnicity”; naCoded was zero and 18.0% for all variables, except for “ethnicity” (zero and 10.7%). CONCLUSION: NamSor is accurate in determining the continent of origin, especially when using the modified variable (continent#2) and/or restricting the analysis to names with accuracy ≥50%. The risk of misclassification is higher with country of origin or ethnicity, but decreases, as with continent of origin, when using the modified variable (country#2) and/or the subsample.
format Online
Article
Text
id pubmed-10653483
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-106534832023-11-16 How well does NamSor perform in predicting the country of origin and ethnicity of individuals based on their first and last names? Sebo, Paul PLoS One Research Article BACKGROUND: We aimed to evaluate NamSor’s performance in predicting the country of origin and ethnicity of individuals based on their first/last names. METHODS: We retrieved the name and country of affiliation of all authors of PubMed publications in 2021, affiliated with universities in the twenty-two countries whose researchers authored ≥1,000 medical publications and whose percentage of migrants was <2.5% (N = 88,699). We estimated with NamSor their most likely "continent of origin" (Asia/Africa/Europe), "country of origin" and "ethnicity". We also examined two other variables that we created: “continent#2” ("Europe" replaced by "Europe/America/Oceania") and “country#2” ("Spain" replaced by “Spain/Hispanic American country” and "Portugal" replaced by "Portugal/Brazil"). Using "country of affiliation" as a proxy for "country of origin", we calculated for these five variables the proportion of misclassifications (= errorCodedWithoutNA) and the proportion of non-classifications (= naCoded). We repeated the analyses with a subsample consisting of all results with inference accuracy ≥50%. RESULTS: For the full sample and the subsample, errorCodedWithoutNA was 16.0% and 12.6% for “continent”, 6.3% and 3.3% for “continent#2”, 27.3% and 19.5% for “country”, 19.7% and 11.4% for “country#2”, and 20.2% and 14.8% for “ethnicity”; naCoded was zero and 18.0% for all variables, except for “ethnicity” (zero and 10.7%). CONCLUSION: NamSor is accurate in determining the continent of origin, especially when using the modified variable (continent#2) and/or restricting the analysis to names with accuracy ≥50%. The risk of misclassification is higher with country of origin or ethnicity, but decreases, as with continent of origin, when using the modified variable (country#2) and/or the subsample. Public Library of Science 2023-11-16 /pmc/articles/PMC10653483/ /pubmed/37972002 http://dx.doi.org/10.1371/journal.pone.0294562 Text en © 2023 Paul Sebo https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Sebo, Paul
How well does NamSor perform in predicting the country of origin and ethnicity of individuals based on their first and last names?
title How well does NamSor perform in predicting the country of origin and ethnicity of individuals based on their first and last names?
title_full How well does NamSor perform in predicting the country of origin and ethnicity of individuals based on their first and last names?
title_fullStr How well does NamSor perform in predicting the country of origin and ethnicity of individuals based on their first and last names?
title_full_unstemmed How well does NamSor perform in predicting the country of origin and ethnicity of individuals based on their first and last names?
title_short How well does NamSor perform in predicting the country of origin and ethnicity of individuals based on their first and last names?
title_sort how well does namsor perform in predicting the country of origin and ethnicity of individuals based on their first and last names?
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10653483/
https://www.ncbi.nlm.nih.gov/pubmed/37972002
http://dx.doi.org/10.1371/journal.pone.0294562
work_keys_str_mv AT sebopaul howwelldoesnamsorperforminpredictingthecountryoforiginandethnicityofindividualsbasedontheirfirstandlastnames