Cargando…
PCA-based population structure inference with generic clustering algorithms
BACKGROUND: Handling genotype data typed at hundreds of thousands of loci is very time-consuming and it is no exception for population structure inference. Therefore, we propose to apply PCA to the genotype data of a population, select the significant principal components using the Tracy-Widom distr...
Autores principales: | , , |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2009
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2648762/ https://www.ncbi.nlm.nih.gov/pubmed/19208178 http://dx.doi.org/10.1186/1471-2105-10-S1-S73 |
_version_ | 1782164982260039680 |
---|---|
author | Lee, Chih Abdool, Ali Huang, Chun-Hsi |
author_facet | Lee, Chih Abdool, Ali Huang, Chun-Hsi |
author_sort | Lee, Chih |
collection | PubMed |
description | BACKGROUND: Handling genotype data typed at hundreds of thousands of loci is very time-consuming and it is no exception for population structure inference. Therefore, we propose to apply PCA to the genotype data of a population, select the significant principal components using the Tracy-Widom distribution, and assign the individuals to one or more subpopulations using generic clustering algorithms. RESULTS: We investigated K-means, soft K-means and spectral clustering and made comparison to STRUCTURE, a model-based algorithm specifically designed for population structure inference. Moreover, we investigated methods for predicting the number of subpopulations in a population. The results on four simulated datasets and two real datasets indicate that our approach performs comparably well to STRUCTURE. For the simulated datasets, STRUCTURE and soft K-means with BIC produced identical predictions on the number of subpopulations. We also showed that, for real dataset, BIC is a better index than likelihood in predicting the number of subpopulations. CONCLUSION: Our approach has the advantage of being fast and scalable, while STRUCTURE is very time-consuming because of the nature of MCMC in parameter estimation. Therefore, we suggest choosing the proper algorithm based on the application of population structure inference. |
format | Text |
id | pubmed-2648762 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2009 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-26487622009-03-03 PCA-based population structure inference with generic clustering algorithms Lee, Chih Abdool, Ali Huang, Chun-Hsi BMC Bioinformatics Research BACKGROUND: Handling genotype data typed at hundreds of thousands of loci is very time-consuming and it is no exception for population structure inference. Therefore, we propose to apply PCA to the genotype data of a population, select the significant principal components using the Tracy-Widom distribution, and assign the individuals to one or more subpopulations using generic clustering algorithms. RESULTS: We investigated K-means, soft K-means and spectral clustering and made comparison to STRUCTURE, a model-based algorithm specifically designed for population structure inference. Moreover, we investigated methods for predicting the number of subpopulations in a population. The results on four simulated datasets and two real datasets indicate that our approach performs comparably well to STRUCTURE. For the simulated datasets, STRUCTURE and soft K-means with BIC produced identical predictions on the number of subpopulations. We also showed that, for real dataset, BIC is a better index than likelihood in predicting the number of subpopulations. CONCLUSION: Our approach has the advantage of being fast and scalable, while STRUCTURE is very time-consuming because of the nature of MCMC in parameter estimation. Therefore, we suggest choosing the proper algorithm based on the application of population structure inference. BioMed Central 2009-01-30 /pmc/articles/PMC2648762/ /pubmed/19208178 http://dx.doi.org/10.1186/1471-2105-10-S1-S73 Text en Copyright © 2009 Lee et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Research Lee, Chih Abdool, Ali Huang, Chun-Hsi PCA-based population structure inference with generic clustering algorithms |
title | PCA-based population structure inference with generic clustering algorithms |
title_full | PCA-based population structure inference with generic clustering algorithms |
title_fullStr | PCA-based population structure inference with generic clustering algorithms |
title_full_unstemmed | PCA-based population structure inference with generic clustering algorithms |
title_short | PCA-based population structure inference with generic clustering algorithms |
title_sort | pca-based population structure inference with generic clustering algorithms |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2648762/ https://www.ncbi.nlm.nih.gov/pubmed/19208178 http://dx.doi.org/10.1186/1471-2105-10-S1-S73 |
work_keys_str_mv | AT leechih pcabasedpopulationstructureinferencewithgenericclusteringalgorithms AT abdoolali pcabasedpopulationstructureinferencewithgenericclusteringalgorithms AT huangchunhsi pcabasedpopulationstructureinferencewithgenericclusteringalgorithms |