Cargando…

A machine learning approach to predict ethnicity using personal name and census location in Canada

BACKGROUND: Canada is an ethnically-diverse country, yet its lack of ethnicity information in many large databases impedes effective population research and interventions. Automated ethnicity classification using machine learning has shown potential to address this data gap but its performance in Ca...

Descripción completa

Detalles Bibliográficos
Autores principales: Wong, Kai On, Zaïane, Osmar R., Davis, Faith G., Yasui, Yutaka
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7673495/
https://www.ncbi.nlm.nih.gov/pubmed/33206667
http://dx.doi.org/10.1371/journal.pone.0241239
_version_ 1783611329329758208
author Wong, Kai On
Zaïane, Osmar R.
Davis, Faith G.
Yasui, Yutaka
author_facet Wong, Kai On
Zaïane, Osmar R.
Davis, Faith G.
Yasui, Yutaka
author_sort Wong, Kai On
collection PubMed
description BACKGROUND: Canada is an ethnically-diverse country, yet its lack of ethnicity information in many large databases impedes effective population research and interventions. Automated ethnicity classification using machine learning has shown potential to address this data gap but its performance in Canada is largely unknown. This study conducted a large-scale machine learning framework to predict ethnicity using a novel set of name and census location features. METHODS: Using census 1901, the multiclass and binary class classification machine learning pipelines were developed. The 13 ethnic categories examined were Aboriginal (First Nations, Métis, Inuit, and all-combined)), Chinese, English, French, Irish, Italian, Japanese, Russian, Scottish, and others. Machine learning algorithms included regularized logistic regression, C-support vector, and naïve Bayes classifiers. Name features consisted of the entire name string, substrings, double-metaphones, and various name-entity patterns, while location features consisted of the entire location string and substrings of province, district, and subdistrict. Predictive performance metrics included sensitivity, specificity, positive predictive value, negative predictive value, F1, Area Under the Curve for Receiver Operating Characteristic curve, and accuracy. RESULTS: The census had 4,812,958 unique individuals. For multiclass classification, the highest performance achieved was 76% F1 and 91% accuracy. For binary classifications for Chinese, French, Italian, Japanese, Russian, and others, the F1 ranged 68–95% (median 87%). The lower performance for English, Irish, and Scottish (F1 ranged 63–67%) was likely due to their shared cultural and linguistic heritage. Adding census location features to the name-based models strongly improved the prediction in Aboriginal classification (F1 increased from 50% to 84%). CONCLUSIONS: The automated machine learning approach using only name and census location features can predict the ethnicity of Canadians with varying performance by specific ethnic categories.
format Online
Article
Text
id pubmed-7673495
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-76734952020-11-19 A machine learning approach to predict ethnicity using personal name and census location in Canada Wong, Kai On Zaïane, Osmar R. Davis, Faith G. Yasui, Yutaka PLoS One Research Article BACKGROUND: Canada is an ethnically-diverse country, yet its lack of ethnicity information in many large databases impedes effective population research and interventions. Automated ethnicity classification using machine learning has shown potential to address this data gap but its performance in Canada is largely unknown. This study conducted a large-scale machine learning framework to predict ethnicity using a novel set of name and census location features. METHODS: Using census 1901, the multiclass and binary class classification machine learning pipelines were developed. The 13 ethnic categories examined were Aboriginal (First Nations, Métis, Inuit, and all-combined)), Chinese, English, French, Irish, Italian, Japanese, Russian, Scottish, and others. Machine learning algorithms included regularized logistic regression, C-support vector, and naïve Bayes classifiers. Name features consisted of the entire name string, substrings, double-metaphones, and various name-entity patterns, while location features consisted of the entire location string and substrings of province, district, and subdistrict. Predictive performance metrics included sensitivity, specificity, positive predictive value, negative predictive value, F1, Area Under the Curve for Receiver Operating Characteristic curve, and accuracy. RESULTS: The census had 4,812,958 unique individuals. For multiclass classification, the highest performance achieved was 76% F1 and 91% accuracy. For binary classifications for Chinese, French, Italian, Japanese, Russian, and others, the F1 ranged 68–95% (median 87%). The lower performance for English, Irish, and Scottish (F1 ranged 63–67%) was likely due to their shared cultural and linguistic heritage. Adding census location features to the name-based models strongly improved the prediction in Aboriginal classification (F1 increased from 50% to 84%). CONCLUSIONS: The automated machine learning approach using only name and census location features can predict the ethnicity of Canadians with varying performance by specific ethnic categories. Public Library of Science 2020-11-18 /pmc/articles/PMC7673495/ /pubmed/33206667 http://dx.doi.org/10.1371/journal.pone.0241239 Text en © 2020 Wong et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Wong, Kai On
Zaïane, Osmar R.
Davis, Faith G.
Yasui, Yutaka
A machine learning approach to predict ethnicity using personal name and census location in Canada
title A machine learning approach to predict ethnicity using personal name and census location in Canada
title_full A machine learning approach to predict ethnicity using personal name and census location in Canada
title_fullStr A machine learning approach to predict ethnicity using personal name and census location in Canada
title_full_unstemmed A machine learning approach to predict ethnicity using personal name and census location in Canada
title_short A machine learning approach to predict ethnicity using personal name and census location in Canada
title_sort machine learning approach to predict ethnicity using personal name and census location in canada
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7673495/
https://www.ncbi.nlm.nih.gov/pubmed/33206667
http://dx.doi.org/10.1371/journal.pone.0241239
work_keys_str_mv AT wongkaion amachinelearningapproachtopredictethnicityusingpersonalnameandcensuslocationincanada
AT zaianeosmarr amachinelearningapproachtopredictethnicityusingpersonalnameandcensuslocationincanada
AT davisfaithg amachinelearningapproachtopredictethnicityusingpersonalnameandcensuslocationincanada
AT yasuiyutaka amachinelearningapproachtopredictethnicityusingpersonalnameandcensuslocationincanada
AT wongkaion machinelearningapproachtopredictethnicityusingpersonalnameandcensuslocationincanada
AT zaianeosmarr machinelearningapproachtopredictethnicityusingpersonalnameandcensuslocationincanada
AT davisfaithg machinelearningapproachtopredictethnicityusingpersonalnameandcensuslocationincanada
AT yasuiyutaka machinelearningapproachtopredictethnicityusingpersonalnameandcensuslocationincanada