Cargando…

Unsupervised discovery of ancestry-informative markers and genetic admixture proportions in biobank-scale datasets

Admixture estimation plays a crucial role in ancestry inference and genome-wide association studies (GWASs). Computer programs such as ADMIXTURE and STRUCTURE are commonly employed to estimate the admixture proportions of sample individuals. However, these programs can be overwhelmed by the computat...

Descripción completa

Detalles Bibliográficos
Autores principales: Ko, Seyoon, Chu, Benjamin B., Peterson, Daniel, Okenwa, Chidera, Papp, Jeanette C., Alexander, David H., Sobel, Eric M., Zhou, Hua, Lange, Kenneth L.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9943729/
https://www.ncbi.nlm.nih.gov/pubmed/36610401
http://dx.doi.org/10.1016/j.ajhg.2022.12.008
_version_ 1784891769716998144
author Ko, Seyoon
Chu, Benjamin B.
Peterson, Daniel
Okenwa, Chidera
Papp, Jeanette C.
Alexander, David H.
Sobel, Eric M.
Zhou, Hua
Lange, Kenneth L.
author_facet Ko, Seyoon
Chu, Benjamin B.
Peterson, Daniel
Okenwa, Chidera
Papp, Jeanette C.
Alexander, David H.
Sobel, Eric M.
Zhou, Hua
Lange, Kenneth L.
author_sort Ko, Seyoon
collection PubMed
description Admixture estimation plays a crucial role in ancestry inference and genome-wide association studies (GWASs). Computer programs such as ADMIXTURE and STRUCTURE are commonly employed to estimate the admixture proportions of sample individuals. However, these programs can be overwhelmed by the computational burdens imposed by the [Formula: see text] to [Formula: see text] samples and millions of markers commonly found in modern biobanks. An attractive strategy is to run these programs on a set of ancestry-informative SNP markers (AIMs) that exhibit substantially different frequencies across populations. Unfortunately, existing methods for identifying AIMs require knowing ancestry labels for a subset of the sample. This supervised learning approach creates a chicken and the egg scenario. In this paper, we present an unsupervised, scalable framework that seamlessly carries out AIM selection and likelihood-based estimation of admixture proportions. Our simulated and real data examples show that this approach is scalable to modern biobank datasets. OpenADMIXTURE, our Julia implementation of the method, is open source and available for free.
format Online
Article
Text
id pubmed-9943729
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-99437292023-02-23 Unsupervised discovery of ancestry-informative markers and genetic admixture proportions in biobank-scale datasets Ko, Seyoon Chu, Benjamin B. Peterson, Daniel Okenwa, Chidera Papp, Jeanette C. Alexander, David H. Sobel, Eric M. Zhou, Hua Lange, Kenneth L. Am J Hum Genet Article Admixture estimation plays a crucial role in ancestry inference and genome-wide association studies (GWASs). Computer programs such as ADMIXTURE and STRUCTURE are commonly employed to estimate the admixture proportions of sample individuals. However, these programs can be overwhelmed by the computational burdens imposed by the [Formula: see text] to [Formula: see text] samples and millions of markers commonly found in modern biobanks. An attractive strategy is to run these programs on a set of ancestry-informative SNP markers (AIMs) that exhibit substantially different frequencies across populations. Unfortunately, existing methods for identifying AIMs require knowing ancestry labels for a subset of the sample. This supervised learning approach creates a chicken and the egg scenario. In this paper, we present an unsupervised, scalable framework that seamlessly carries out AIM selection and likelihood-based estimation of admixture proportions. Our simulated and real data examples show that this approach is scalable to modern biobank datasets. OpenADMIXTURE, our Julia implementation of the method, is open source and available for free. Elsevier 2023-02-02 2023-01-06 /pmc/articles/PMC9943729/ /pubmed/36610401 http://dx.doi.org/10.1016/j.ajhg.2022.12.008 Text en © 2022 The Authors https://creativecommons.org/licenses/by-nc-nd/4.0/This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
spellingShingle Article
Ko, Seyoon
Chu, Benjamin B.
Peterson, Daniel
Okenwa, Chidera
Papp, Jeanette C.
Alexander, David H.
Sobel, Eric M.
Zhou, Hua
Lange, Kenneth L.
Unsupervised discovery of ancestry-informative markers and genetic admixture proportions in biobank-scale datasets
title Unsupervised discovery of ancestry-informative markers and genetic admixture proportions in biobank-scale datasets
title_full Unsupervised discovery of ancestry-informative markers and genetic admixture proportions in biobank-scale datasets
title_fullStr Unsupervised discovery of ancestry-informative markers and genetic admixture proportions in biobank-scale datasets
title_full_unstemmed Unsupervised discovery of ancestry-informative markers and genetic admixture proportions in biobank-scale datasets
title_short Unsupervised discovery of ancestry-informative markers and genetic admixture proportions in biobank-scale datasets
title_sort unsupervised discovery of ancestry-informative markers and genetic admixture proportions in biobank-scale datasets
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9943729/
https://www.ncbi.nlm.nih.gov/pubmed/36610401
http://dx.doi.org/10.1016/j.ajhg.2022.12.008
work_keys_str_mv AT koseyoon unsuperviseddiscoveryofancestryinformativemarkersandgeneticadmixtureproportionsinbiobankscaledatasets
AT chubenjaminb unsuperviseddiscoveryofancestryinformativemarkersandgeneticadmixtureproportionsinbiobankscaledatasets
AT petersondaniel unsuperviseddiscoveryofancestryinformativemarkersandgeneticadmixtureproportionsinbiobankscaledatasets
AT okenwachidera unsuperviseddiscoveryofancestryinformativemarkersandgeneticadmixtureproportionsinbiobankscaledatasets
AT pappjeanettec unsuperviseddiscoveryofancestryinformativemarkersandgeneticadmixtureproportionsinbiobankscaledatasets
AT alexanderdavidh unsuperviseddiscoveryofancestryinformativemarkersandgeneticadmixtureproportionsinbiobankscaledatasets
AT sobelericm unsuperviseddiscoveryofancestryinformativemarkersandgeneticadmixtureproportionsinbiobankscaledatasets
AT zhouhua unsuperviseddiscoveryofancestryinformativemarkersandgeneticadmixtureproportionsinbiobankscaledatasets
AT langekennethl unsuperviseddiscoveryofancestryinformativemarkersandgeneticadmixtureproportionsinbiobankscaledatasets