Cargando…

Study of large and highly stratified population datasets by combining iterative pruning principal component analysis and structure

BACKGROUND: The ever increasing sizes of population genetic datasets pose great challenges for population structure analysis. The Tracy-Widom (TW) statistical test is widely used for detecting structure. However, it has not been adequately investigated whether the TW statistic is susceptible to type...

Descripción completa

Detalles Bibliográficos
Autores principales: Limpiti, Tulaya, Intarapanich, Apichart, Assawamakin, Anunchai, Shaw, Philip J, Wangkumhang, Pongsakorn, Piriyapongsa, Jittima, Ngamphiw, Chumpol, Tongsima, Sissades
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3148578/
https://www.ncbi.nlm.nih.gov/pubmed/21699684
http://dx.doi.org/10.1186/1471-2105-12-255
_version_ 1782209366111289344
author Limpiti, Tulaya
Intarapanich, Apichart
Assawamakin, Anunchai
Shaw, Philip J
Wangkumhang, Pongsakorn
Piriyapongsa, Jittima
Ngamphiw, Chumpol
Tongsima, Sissades
author_facet Limpiti, Tulaya
Intarapanich, Apichart
Assawamakin, Anunchai
Shaw, Philip J
Wangkumhang, Pongsakorn
Piriyapongsa, Jittima
Ngamphiw, Chumpol
Tongsima, Sissades
author_sort Limpiti, Tulaya
collection PubMed
description BACKGROUND: The ever increasing sizes of population genetic datasets pose great challenges for population structure analysis. The Tracy-Widom (TW) statistical test is widely used for detecting structure. However, it has not been adequately investigated whether the TW statistic is susceptible to type I error, especially in large, complex datasets. Non-parametric, Principal Component Analysis (PCA) based methods for resolving structure have been developed which rely on the TW test. Although PCA-based methods can resolve structure, they cannot infer ancestry. Model-based methods are still needed for ancestry analysis, but they are not suitable for large datasets. We propose a new structure analysis framework for large datasets. This includes a new heuristic for detecting structure and incorporation of the structure patterns inferred by a PCA method to complement STRUCTURE analysis. RESULTS: A new heuristic called EigenDev for detecting population structure is presented. When tested on simulated data, this heuristic is robust to sample size. In contrast, the TW statistic was found to be susceptible to type I error, especially for large population samples. EigenDev is thus better-suited for analysis of large datasets containing many individuals, in which spurious patterns are likely to exist and could be incorrectly interpreted as population stratification. EigenDev was applied to the iterative pruning PCA (ipPCA) method, which resolves the underlying subpopulations. This subpopulation information was used to supervise STRUCTURE analysis to infer patterns of ancestry at an unprecedented level of resolution. To validate the new approach, a bovine and a large human genetic dataset (3945 individuals) were analyzed. We found new ancestry patterns consistent with the subpopulations resolved by ipPCA. CONCLUSIONS: The EigenDev heuristic is robust to sampling and is thus superior for detecting structure in large datasets. The application of EigenDev to the ipPCA algorithm improves the estimation of the number of subpopulations and the individual assignment accuracy, especially for very large and complex datasets. Furthermore, we have demonstrated that the structure resolved by this approach complements parametric analysis, allowing a much more comprehensive account of population structure. The new version of the ipPCA software with EigenDev incorporated can be downloaded from http://www4a.biotec.or.th/GI/tools/ippca.
format Online
Article
Text
id pubmed-3148578
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-31485782011-08-03 Study of large and highly stratified population datasets by combining iterative pruning principal component analysis and structure Limpiti, Tulaya Intarapanich, Apichart Assawamakin, Anunchai Shaw, Philip J Wangkumhang, Pongsakorn Piriyapongsa, Jittima Ngamphiw, Chumpol Tongsima, Sissades BMC Bioinformatics Methodology Article BACKGROUND: The ever increasing sizes of population genetic datasets pose great challenges for population structure analysis. The Tracy-Widom (TW) statistical test is widely used for detecting structure. However, it has not been adequately investigated whether the TW statistic is susceptible to type I error, especially in large, complex datasets. Non-parametric, Principal Component Analysis (PCA) based methods for resolving structure have been developed which rely on the TW test. Although PCA-based methods can resolve structure, they cannot infer ancestry. Model-based methods are still needed for ancestry analysis, but they are not suitable for large datasets. We propose a new structure analysis framework for large datasets. This includes a new heuristic for detecting structure and incorporation of the structure patterns inferred by a PCA method to complement STRUCTURE analysis. RESULTS: A new heuristic called EigenDev for detecting population structure is presented. When tested on simulated data, this heuristic is robust to sample size. In contrast, the TW statistic was found to be susceptible to type I error, especially for large population samples. EigenDev is thus better-suited for analysis of large datasets containing many individuals, in which spurious patterns are likely to exist and could be incorrectly interpreted as population stratification. EigenDev was applied to the iterative pruning PCA (ipPCA) method, which resolves the underlying subpopulations. This subpopulation information was used to supervise STRUCTURE analysis to infer patterns of ancestry at an unprecedented level of resolution. To validate the new approach, a bovine and a large human genetic dataset (3945 individuals) were analyzed. We found new ancestry patterns consistent with the subpopulations resolved by ipPCA. CONCLUSIONS: The EigenDev heuristic is robust to sampling and is thus superior for detecting structure in large datasets. The application of EigenDev to the ipPCA algorithm improves the estimation of the number of subpopulations and the individual assignment accuracy, especially for very large and complex datasets. Furthermore, we have demonstrated that the structure resolved by this approach complements parametric analysis, allowing a much more comprehensive account of population structure. The new version of the ipPCA software with EigenDev incorporated can be downloaded from http://www4a.biotec.or.th/GI/tools/ippca. BioMed Central 2011-06-23 /pmc/articles/PMC3148578/ /pubmed/21699684 http://dx.doi.org/10.1186/1471-2105-12-255 Text en Copyright ©2011 Limpiti et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
Limpiti, Tulaya
Intarapanich, Apichart
Assawamakin, Anunchai
Shaw, Philip J
Wangkumhang, Pongsakorn
Piriyapongsa, Jittima
Ngamphiw, Chumpol
Tongsima, Sissades
Study of large and highly stratified population datasets by combining iterative pruning principal component analysis and structure
title Study of large and highly stratified population datasets by combining iterative pruning principal component analysis and structure
title_full Study of large and highly stratified population datasets by combining iterative pruning principal component analysis and structure
title_fullStr Study of large and highly stratified population datasets by combining iterative pruning principal component analysis and structure
title_full_unstemmed Study of large and highly stratified population datasets by combining iterative pruning principal component analysis and structure
title_short Study of large and highly stratified population datasets by combining iterative pruning principal component analysis and structure
title_sort study of large and highly stratified population datasets by combining iterative pruning principal component analysis and structure
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3148578/
https://www.ncbi.nlm.nih.gov/pubmed/21699684
http://dx.doi.org/10.1186/1471-2105-12-255
work_keys_str_mv AT limpititulaya studyoflargeandhighlystratifiedpopulationdatasetsbycombiningiterativepruningprincipalcomponentanalysisandstructure
AT intarapanichapichart studyoflargeandhighlystratifiedpopulationdatasetsbycombiningiterativepruningprincipalcomponentanalysisandstructure
AT assawamakinanunchai studyoflargeandhighlystratifiedpopulationdatasetsbycombiningiterativepruningprincipalcomponentanalysisandstructure
AT shawphilipj studyoflargeandhighlystratifiedpopulationdatasetsbycombiningiterativepruningprincipalcomponentanalysisandstructure
AT wangkumhangpongsakorn studyoflargeandhighlystratifiedpopulationdatasetsbycombiningiterativepruningprincipalcomponentanalysisandstructure
AT piriyapongsajittima studyoflargeandhighlystratifiedpopulationdatasetsbycombiningiterativepruningprincipalcomponentanalysisandstructure
AT ngamphiwchumpol studyoflargeandhighlystratifiedpopulationdatasetsbycombiningiterativepruningprincipalcomponentanalysisandstructure
AT tongsimasissades studyoflargeandhighlystratifiedpopulationdatasetsbycombiningiterativepruningprincipalcomponentanalysisandstructure