Cargando…

ETHNOPRED: a novel machine learning method for accurate continental and sub-continental ancestry identification and population stratification correction

BACKGROUND: Population stratification is a systematic difference in allele frequencies between subpopulations. This can lead to spurious association findings in the case–control genome wide association studies (GWASs) used to identify single nucleotide polymorphisms (SNPs) associated with disease-li...

Descripción completa

Detalles Bibliográficos
Autores principales: Hajiloo, Mohsen, Sapkota, Yadav, Mackey, John R, Robson, Paula, Greiner, Russell, Damaraju, Sambasivarao
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3618021/
https://www.ncbi.nlm.nih.gov/pubmed/23432980
http://dx.doi.org/10.1186/1471-2105-14-61
_version_ 1782265340763308032
author Hajiloo, Mohsen
Sapkota, Yadav
Mackey, John R
Robson, Paula
Greiner, Russell
Damaraju, Sambasivarao
author_facet Hajiloo, Mohsen
Sapkota, Yadav
Mackey, John R
Robson, Paula
Greiner, Russell
Damaraju, Sambasivarao
author_sort Hajiloo, Mohsen
collection PubMed
description BACKGROUND: Population stratification is a systematic difference in allele frequencies between subpopulations. This can lead to spurious association findings in the case–control genome wide association studies (GWASs) used to identify single nucleotide polymorphisms (SNPs) associated with disease-linked phenotypes. Methods such as self-declared ancestry, ancestry informative markers, genomic control, structured association, and principal component analysis are used to assess and correct population stratification but each has limitations. We provide an alternative technique to address population stratification. RESULTS: We propose a novel machine learning method, ETHNOPRED, which uses the genotype and ethnicity data from the HapMap project to learn ensembles of disjoint decision trees, capable of accurately predicting an individual’s continental and sub-continental ancestry. To predict an individual’s continental ancestry, ETHNOPRED produced an ensemble of 3 decision trees involving a total of 10 SNPs, with 10-fold cross validation accuracy of 100% using HapMap II dataset. We extended this model to involve 29 disjoint decision trees over 149 SNPs, and showed that this ensemble has an accuracy of ≥ 99.9%, even if some of those 149 SNP values were missing. On an independent dataset, predominantly of Caucasian origin, our continental classifier showed 96.8% accuracy and improved genomic control’s λ from 1.22 to 1.11. We next used the HapMap III dataset to learn classifiers to distinguish European subpopulations (North-Western vs. Southern), East Asian subpopulations (Chinese vs. Japanese), African subpopulations (Eastern vs. Western), North American subpopulations (European vs. Chinese vs. African vs. Mexican vs. Indian), and Kenyan subpopulations (Luhya vs. Maasai). In these cases, ETHNOPRED produced ensembles of 3, 39, 21, 11, and 25 disjoint decision trees, respectively involving 31, 502, 526, 242 and 271 SNPs, with 10-fold cross validation accuracy of 86.5% ± 2.4%, 95.6% ± 3.9%, 95.6% ± 2.1%, 98.3% ± 2.0%, and 95.9% ± 1.5%. However, ETHNOPRED was unable to produce a classifier that can accurately distinguish Chinese in Beijing vs. Chinese in Denver. CONCLUSIONS: ETHNOPRED is a novel technique for producing classifiers that can identify an individual’s continental and sub-continental heritage, based on a small number of SNPs. We show that its learned classifiers are simple, cost-efficient, accurate, transparent, flexible, fast, applicable to large scale GWASs, and robust to missing values.
format Online
Article
Text
id pubmed-3618021
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-36180212013-04-10 ETHNOPRED: a novel machine learning method for accurate continental and sub-continental ancestry identification and population stratification correction Hajiloo, Mohsen Sapkota, Yadav Mackey, John R Robson, Paula Greiner, Russell Damaraju, Sambasivarao BMC Bioinformatics Research Article BACKGROUND: Population stratification is a systematic difference in allele frequencies between subpopulations. This can lead to spurious association findings in the case–control genome wide association studies (GWASs) used to identify single nucleotide polymorphisms (SNPs) associated with disease-linked phenotypes. Methods such as self-declared ancestry, ancestry informative markers, genomic control, structured association, and principal component analysis are used to assess and correct population stratification but each has limitations. We provide an alternative technique to address population stratification. RESULTS: We propose a novel machine learning method, ETHNOPRED, which uses the genotype and ethnicity data from the HapMap project to learn ensembles of disjoint decision trees, capable of accurately predicting an individual’s continental and sub-continental ancestry. To predict an individual’s continental ancestry, ETHNOPRED produced an ensemble of 3 decision trees involving a total of 10 SNPs, with 10-fold cross validation accuracy of 100% using HapMap II dataset. We extended this model to involve 29 disjoint decision trees over 149 SNPs, and showed that this ensemble has an accuracy of ≥ 99.9%, even if some of those 149 SNP values were missing. On an independent dataset, predominantly of Caucasian origin, our continental classifier showed 96.8% accuracy and improved genomic control’s λ from 1.22 to 1.11. We next used the HapMap III dataset to learn classifiers to distinguish European subpopulations (North-Western vs. Southern), East Asian subpopulations (Chinese vs. Japanese), African subpopulations (Eastern vs. Western), North American subpopulations (European vs. Chinese vs. African vs. Mexican vs. Indian), and Kenyan subpopulations (Luhya vs. Maasai). In these cases, ETHNOPRED produced ensembles of 3, 39, 21, 11, and 25 disjoint decision trees, respectively involving 31, 502, 526, 242 and 271 SNPs, with 10-fold cross validation accuracy of 86.5% ± 2.4%, 95.6% ± 3.9%, 95.6% ± 2.1%, 98.3% ± 2.0%, and 95.9% ± 1.5%. However, ETHNOPRED was unable to produce a classifier that can accurately distinguish Chinese in Beijing vs. Chinese in Denver. CONCLUSIONS: ETHNOPRED is a novel technique for producing classifiers that can identify an individual’s continental and sub-continental heritage, based on a small number of SNPs. We show that its learned classifiers are simple, cost-efficient, accurate, transparent, flexible, fast, applicable to large scale GWASs, and robust to missing values. BioMed Central 2013-02-22 /pmc/articles/PMC3618021/ /pubmed/23432980 http://dx.doi.org/10.1186/1471-2105-14-61 Text en Copyright © 2013 Hajiloo et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Hajiloo, Mohsen
Sapkota, Yadav
Mackey, John R
Robson, Paula
Greiner, Russell
Damaraju, Sambasivarao
ETHNOPRED: a novel machine learning method for accurate continental and sub-continental ancestry identification and population stratification correction
title ETHNOPRED: a novel machine learning method for accurate continental and sub-continental ancestry identification and population stratification correction
title_full ETHNOPRED: a novel machine learning method for accurate continental and sub-continental ancestry identification and population stratification correction
title_fullStr ETHNOPRED: a novel machine learning method for accurate continental and sub-continental ancestry identification and population stratification correction
title_full_unstemmed ETHNOPRED: a novel machine learning method for accurate continental and sub-continental ancestry identification and population stratification correction
title_short ETHNOPRED: a novel machine learning method for accurate continental and sub-continental ancestry identification and population stratification correction
title_sort ethnopred: a novel machine learning method for accurate continental and sub-continental ancestry identification and population stratification correction
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3618021/
https://www.ncbi.nlm.nih.gov/pubmed/23432980
http://dx.doi.org/10.1186/1471-2105-14-61
work_keys_str_mv AT hajiloomohsen ethnopredanovelmachinelearningmethodforaccuratecontinentalandsubcontinentalancestryidentificationandpopulationstratificationcorrection
AT sapkotayadav ethnopredanovelmachinelearningmethodforaccuratecontinentalandsubcontinentalancestryidentificationandpopulationstratificationcorrection
AT mackeyjohnr ethnopredanovelmachinelearningmethodforaccuratecontinentalandsubcontinentalancestryidentificationandpopulationstratificationcorrection
AT robsonpaula ethnopredanovelmachinelearningmethodforaccuratecontinentalandsubcontinentalancestryidentificationandpopulationstratificationcorrection
AT greinerrussell ethnopredanovelmachinelearningmethodforaccuratecontinentalandsubcontinentalancestryidentificationandpopulationstratificationcorrection
AT damarajusambasivarao ethnopredanovelmachinelearningmethodforaccuratecontinentalandsubcontinentalancestryidentificationandpopulationstratificationcorrection