Cargando…

Consequences of PCA graphs, SNP codings, and PCA variants for elucidating population structure

SNP datasets are high-dimensional, often with thousands to millions of SNPs and hundreds to thousands of samples or individuals. Accordingly, PCA graphs are frequently used to provide a low-dimensional visualization in order to display and discover patterns in SNP data from humans, animals, plants,...

Descripción completa

Detalles Bibliográficos
Autores principales: Gauch, Hugh G., Qian, Sheng, Piepho, Hans-Peter, Zhou, Linda, Chen, Rui
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6581268/
https://www.ncbi.nlm.nih.gov/pubmed/31211811
http://dx.doi.org/10.1371/journal.pone.0218306
_version_ 1783428153420546048
author Gauch, Hugh G.
Qian, Sheng
Piepho, Hans-Peter
Zhou, Linda
Chen, Rui
author_facet Gauch, Hugh G.
Qian, Sheng
Piepho, Hans-Peter
Zhou, Linda
Chen, Rui
author_sort Gauch, Hugh G.
collection PubMed
description SNP datasets are high-dimensional, often with thousands to millions of SNPs and hundreds to thousands of samples or individuals. Accordingly, PCA graphs are frequently used to provide a low-dimensional visualization in order to display and discover patterns in SNP data from humans, animals, plants, and microbes—especially to elucidate population structure. PCA is not a single method that is always done the same way, but rather requires three choices which we explore as a three-way factorial: two kinds of PCA graphs by three SNP codings by six PCA variants. Our main three recommendations are simple and easily implemented: Use PCA biplots, SNP coding 1 for the rare allele and 0 for the common allele, and double-centered PCA (or AMMI1 if main effects are also of interest). We also document contemporary practices by a literature survey of 125 representative articles that apply PCA to SNP data, find that virtually none implement our recommendations. The ultimate benefit from informed and optimal choices of PCA graph, SNP coding, and PCA variant, is expected to be discovery of more biology, and thereby acceleration of medical, agricultural, and other vital applications.
format Online
Article
Text
id pubmed-6581268
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-65812682019-06-28 Consequences of PCA graphs, SNP codings, and PCA variants for elucidating population structure Gauch, Hugh G. Qian, Sheng Piepho, Hans-Peter Zhou, Linda Chen, Rui PLoS One Research Article SNP datasets are high-dimensional, often with thousands to millions of SNPs and hundreds to thousands of samples or individuals. Accordingly, PCA graphs are frequently used to provide a low-dimensional visualization in order to display and discover patterns in SNP data from humans, animals, plants, and microbes—especially to elucidate population structure. PCA is not a single method that is always done the same way, but rather requires three choices which we explore as a three-way factorial: two kinds of PCA graphs by three SNP codings by six PCA variants. Our main three recommendations are simple and easily implemented: Use PCA biplots, SNP coding 1 for the rare allele and 0 for the common allele, and double-centered PCA (or AMMI1 if main effects are also of interest). We also document contemporary practices by a literature survey of 125 representative articles that apply PCA to SNP data, find that virtually none implement our recommendations. The ultimate benefit from informed and optimal choices of PCA graph, SNP coding, and PCA variant, is expected to be discovery of more biology, and thereby acceleration of medical, agricultural, and other vital applications. Public Library of Science 2019-06-18 /pmc/articles/PMC6581268/ /pubmed/31211811 http://dx.doi.org/10.1371/journal.pone.0218306 Text en © 2019 Gauch et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Gauch, Hugh G.
Qian, Sheng
Piepho, Hans-Peter
Zhou, Linda
Chen, Rui
Consequences of PCA graphs, SNP codings, and PCA variants for elucidating population structure
title Consequences of PCA graphs, SNP codings, and PCA variants for elucidating population structure
title_full Consequences of PCA graphs, SNP codings, and PCA variants for elucidating population structure
title_fullStr Consequences of PCA graphs, SNP codings, and PCA variants for elucidating population structure
title_full_unstemmed Consequences of PCA graphs, SNP codings, and PCA variants for elucidating population structure
title_short Consequences of PCA graphs, SNP codings, and PCA variants for elucidating population structure
title_sort consequences of pca graphs, snp codings, and pca variants for elucidating population structure
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6581268/
https://www.ncbi.nlm.nih.gov/pubmed/31211811
http://dx.doi.org/10.1371/journal.pone.0218306
work_keys_str_mv AT gauchhughg consequencesofpcagraphssnpcodingsandpcavariantsforelucidatingpopulationstructure
AT qiansheng consequencesofpcagraphssnpcodingsandpcavariantsforelucidatingpopulationstructure
AT piephohanspeter consequencesofpcagraphssnpcodingsandpcavariantsforelucidatingpopulationstructure
AT zhoulinda consequencesofpcagraphssnpcodingsandpcavariantsforelucidatingpopulationstructure
AT chenrui consequencesofpcagraphssnpcodingsandpcavariantsforelucidatingpopulationstructure