Cargando…
Manifold Learning for Human Population Structure Studies
The dimension of the population genetics data produced by next-generation sequencing platforms is extremely high. However, the “intrinsic dimensionality” of sequence data, which determines the structure of populations, is much lower. This motivates us to use locally linear embedding (LLE) which proj...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2012
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3260176/ https://www.ncbi.nlm.nih.gov/pubmed/22272259 http://dx.doi.org/10.1371/journal.pone.0029901 |
_version_ | 1782221451670061056 |
---|---|
author | Siu, Hoicheong Jin, Li Xiong, Momiao |
author_facet | Siu, Hoicheong Jin, Li Xiong, Momiao |
author_sort | Siu, Hoicheong |
collection | PubMed |
description | The dimension of the population genetics data produced by next-generation sequencing platforms is extremely high. However, the “intrinsic dimensionality” of sequence data, which determines the structure of populations, is much lower. This motivates us to use locally linear embedding (LLE) which projects high dimensional genomic data into low dimensional, neighborhood preserving embedding, as a general framework for population structure and historical inference. To facilitate application of the LLE to population genetic analysis, we systematically investigate several important properties of the LLE and reveal the connection between the LLE and principal component analysis (PCA). Identifying a set of markers and genomic regions which could be used for population structure analysis will provide invaluable information for population genetics and association studies. In addition to identifying the LLE-correlated or PCA-correlated structure informative marker, we have developed a new statistic that integrates genomic information content in a genomic region for collectively studying its association with the population structure and LASSO algorithm to search such regions across the genomes. We applied the developed methodologies to a low coverage pilot dataset in the 1000 Genomes Project and a PHASE III Mexico dataset of the HapMap. We observed that 25.1%, 44.9% and 21.4% of the common variants and 89.2%, 92.4% and 75.1% of the rare variants were the LLE-correlated markers in CEU, YRI and ASI, respectively. This showed that rare variants, which are often private to specific populations, have much higher power to identify population substructure than common variants. The preliminary results demonstrated that next generation sequencing offers a rich resources and LLE provide a powerful tool for population structure analysis. |
format | Online Article Text |
id | pubmed-3260176 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2012 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-32601762012-01-23 Manifold Learning for Human Population Structure Studies Siu, Hoicheong Jin, Li Xiong, Momiao PLoS One Research Article The dimension of the population genetics data produced by next-generation sequencing platforms is extremely high. However, the “intrinsic dimensionality” of sequence data, which determines the structure of populations, is much lower. This motivates us to use locally linear embedding (LLE) which projects high dimensional genomic data into low dimensional, neighborhood preserving embedding, as a general framework for population structure and historical inference. To facilitate application of the LLE to population genetic analysis, we systematically investigate several important properties of the LLE and reveal the connection between the LLE and principal component analysis (PCA). Identifying a set of markers and genomic regions which could be used for population structure analysis will provide invaluable information for population genetics and association studies. In addition to identifying the LLE-correlated or PCA-correlated structure informative marker, we have developed a new statistic that integrates genomic information content in a genomic region for collectively studying its association with the population structure and LASSO algorithm to search such regions across the genomes. We applied the developed methodologies to a low coverage pilot dataset in the 1000 Genomes Project and a PHASE III Mexico dataset of the HapMap. We observed that 25.1%, 44.9% and 21.4% of the common variants and 89.2%, 92.4% and 75.1% of the rare variants were the LLE-correlated markers in CEU, YRI and ASI, respectively. This showed that rare variants, which are often private to specific populations, have much higher power to identify population substructure than common variants. The preliminary results demonstrated that next generation sequencing offers a rich resources and LLE provide a powerful tool for population structure analysis. Public Library of Science 2012-01-17 /pmc/articles/PMC3260176/ /pubmed/22272259 http://dx.doi.org/10.1371/journal.pone.0029901 Text en Siu et al. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited. |
spellingShingle | Research Article Siu, Hoicheong Jin, Li Xiong, Momiao Manifold Learning for Human Population Structure Studies |
title | Manifold Learning for Human Population Structure Studies |
title_full | Manifold Learning for Human Population Structure Studies |
title_fullStr | Manifold Learning for Human Population Structure Studies |
title_full_unstemmed | Manifold Learning for Human Population Structure Studies |
title_short | Manifold Learning for Human Population Structure Studies |
title_sort | manifold learning for human population structure studies |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3260176/ https://www.ncbi.nlm.nih.gov/pubmed/22272259 http://dx.doi.org/10.1371/journal.pone.0029901 |
work_keys_str_mv | AT siuhoicheong manifoldlearningforhumanpopulationstructurestudies AT jinli manifoldlearningforhumanpopulationstructurestudies AT xiongmomiao manifoldlearningforhumanpopulationstructurestudies |