Cargando…

StructHDP: automatic inference of number of clusters and population structure from admixed genotype data

Motivation: Clustering of genotype data is an important way of understanding similarities and differences between populations. A summary of populations through clustering allows us to make inferences about the evolutionary history of the populations. Many methods have been proposed to perform cluste...

Descripción completa

Detalles Bibliográficos
Autores principales:	Shringarpure, Suyash, Won, Daegun, Xing, Eric P.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2011
Materias:	Ismb/Eccb 2011 Proceedings Papers Committee July 17 to July 19, 2011, Vienna, Austria
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3117349/ https://www.ncbi.nlm.nih.gov/pubmed/21685088 http://dx.doi.org/10.1093/bioinformatics/btr242

_version_	1782206320269590528
author	Shringarpure, Suyash Won, Daegun Xing, Eric P.
author_facet	Shringarpure, Suyash Won, Daegun Xing, Eric P.
author_sort	Shringarpure, Suyash
collection	PubMed
description	Motivation: Clustering of genotype data is an important way of understanding similarities and differences between populations. A summary of populations through clustering allows us to make inferences about the evolutionary history of the populations. Many methods have been proposed to perform clustering on multilocus genotype data. However, most of these methods do not directly address the question of how many clusters the data should be divided into and leave that choice to the user. Methods: We present StructHDP, which is a method for automatically inferring the number of clusters from genotype data in the presence of admixture. Our method is an extension of two existing methods, Structure and Structurama. Using a Hierarchical Dirichlet Process (HDP), we model the presence of admixture of an unknown number of ancestral populations in a given sample of genotype data. We use a Gibbs sampler to perform inference on the resulting model and infer the ancestry proportions and the number of clusters that best explain the data. Results: To demonstrate our method, we simulated data from an island model using the neutral coalescent. Comparing the results of StructHDP with Structurama shows the utility of combining HDPs with the Structure model. We used StructHDP to analyze a dataset of 155 Taita thrush, Turdus helleri, which has been previously analyzed using Structure and Structurama. StructHDP correctly picks the optimal number of populations to cluster the data. The clustering based on the inferred ancestry proportions also agrees with that inferred using Structure for the optimal number of populations. We also analyzed data from 1048 individuals from the Human Genome Diversity project from 53 world populations. We found that the clusters obtained correspond with major geographical divisions of the world, which is in agreement with previous analyses of the dataset. Availability: StructHDP is written in C++. The code will be available for download at http://www.sailing.cs.cmu.edu/structhdp. Contact: suyash@cs.cmu.edu; epxing@cs.cmu.edu
format	Online Article Text
id	pubmed-3117349
institution	National Center for Biotechnology Information
language	English
publishDate	2011
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-31173492011-06-17 StructHDP: automatic inference of number of clusters and population structure from admixed genotype data Shringarpure, Suyash Won, Daegun Xing, Eric P. Bioinformatics Ismb/Eccb 2011 Proceedings Papers Committee July 17 to July 19, 2011, Vienna, Austria Motivation: Clustering of genotype data is an important way of understanding similarities and differences between populations. A summary of populations through clustering allows us to make inferences about the evolutionary history of the populations. Many methods have been proposed to perform clustering on multilocus genotype data. However, most of these methods do not directly address the question of how many clusters the data should be divided into and leave that choice to the user. Methods: We present StructHDP, which is a method for automatically inferring the number of clusters from genotype data in the presence of admixture. Our method is an extension of two existing methods, Structure and Structurama. Using a Hierarchical Dirichlet Process (HDP), we model the presence of admixture of an unknown number of ancestral populations in a given sample of genotype data. We use a Gibbs sampler to perform inference on the resulting model and infer the ancestry proportions and the number of clusters that best explain the data. Results: To demonstrate our method, we simulated data from an island model using the neutral coalescent. Comparing the results of StructHDP with Structurama shows the utility of combining HDPs with the Structure model. We used StructHDP to analyze a dataset of 155 Taita thrush, Turdus helleri, which has been previously analyzed using Structure and Structurama. StructHDP correctly picks the optimal number of populations to cluster the data. The clustering based on the inferred ancestry proportions also agrees with that inferred using Structure for the optimal number of populations. We also analyzed data from 1048 individuals from the Human Genome Diversity project from 53 world populations. We found that the clusters obtained correspond with major geographical divisions of the world, which is in agreement with previous analyses of the dataset. Availability: StructHDP is written in C++. The code will be available for download at http://www.sailing.cs.cmu.edu/structhdp. Contact: suyash@cs.cmu.edu; epxing@cs.cmu.edu Oxford University Press 2011-07-01 2011-06-14 /pmc/articles/PMC3117349/ /pubmed/21685088 http://dx.doi.org/10.1093/bioinformatics/btr242 Text en © The Author(s) 2011. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/2.5 This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Ismb/Eccb 2011 Proceedings Papers Committee July 17 to July 19, 2011, Vienna, Austria Shringarpure, Suyash Won, Daegun Xing, Eric P. StructHDP: automatic inference of number of clusters and population structure from admixed genotype data
title	StructHDP: automatic inference of number of clusters and population structure from admixed genotype data
title_full	StructHDP: automatic inference of number of clusters and population structure from admixed genotype data
title_fullStr	StructHDP: automatic inference of number of clusters and population structure from admixed genotype data
title_full_unstemmed	StructHDP: automatic inference of number of clusters and population structure from admixed genotype data
title_short	StructHDP: automatic inference of number of clusters and population structure from admixed genotype data
title_sort	structhdp: automatic inference of number of clusters and population structure from admixed genotype data
topic	Ismb/Eccb 2011 Proceedings Papers Committee July 17 to July 19, 2011, Vienna, Austria
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3117349/ https://www.ncbi.nlm.nih.gov/pubmed/21685088 http://dx.doi.org/10.1093/bioinformatics/btr242
work_keys_str_mv	AT shringarpuresuyash structhdpautomaticinferenceofnumberofclustersandpopulationstructurefromadmixedgenotypedata AT wondaegun structhdpautomaticinferenceofnumberofclustersandpopulationstructurefromadmixedgenotypedata AT xingericp structhdpautomaticinferenceofnumberofclustersandpopulationstructurefromadmixedgenotypedata

StructHDP: automatic inference of number of clusters and population structure from admixed genotype data

Ejemplares similares