Cargando…

Manifold Learning for Human Population Structure Studies

The dimension of the population genetics data produced by next-generation sequencing platforms is extremely high. However, the “intrinsic dimensionality” of sequence data, which determines the structure of populations, is much lower. This motivates us to use locally linear embedding (LLE) which proj...

Descripción completa

Detalles Bibliográficos
Autores principales: Siu, Hoicheong, Jin, Li, Xiong, Momiao
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3260176/
https://www.ncbi.nlm.nih.gov/pubmed/22272259
http://dx.doi.org/10.1371/journal.pone.0029901
_version_ 1782221451670061056
author Siu, Hoicheong
Jin, Li
Xiong, Momiao
author_facet Siu, Hoicheong
Jin, Li
Xiong, Momiao
author_sort Siu, Hoicheong
collection PubMed
description The dimension of the population genetics data produced by next-generation sequencing platforms is extremely high. However, the “intrinsic dimensionality” of sequence data, which determines the structure of populations, is much lower. This motivates us to use locally linear embedding (LLE) which projects high dimensional genomic data into low dimensional, neighborhood preserving embedding, as a general framework for population structure and historical inference. To facilitate application of the LLE to population genetic analysis, we systematically investigate several important properties of the LLE and reveal the connection between the LLE and principal component analysis (PCA). Identifying a set of markers and genomic regions which could be used for population structure analysis will provide invaluable information for population genetics and association studies. In addition to identifying the LLE-correlated or PCA-correlated structure informative marker, we have developed a new statistic that integrates genomic information content in a genomic region for collectively studying its association with the population structure and LASSO algorithm to search such regions across the genomes. We applied the developed methodologies to a low coverage pilot dataset in the 1000 Genomes Project and a PHASE III Mexico dataset of the HapMap. We observed that 25.1%, 44.9% and 21.4% of the common variants and 89.2%, 92.4% and 75.1% of the rare variants were the LLE-correlated markers in CEU, YRI and ASI, respectively. This showed that rare variants, which are often private to specific populations, have much higher power to identify population substructure than common variants. The preliminary results demonstrated that next generation sequencing offers a rich resources and LLE provide a powerful tool for population structure analysis.
format Online
Article
Text
id pubmed-3260176
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-32601762012-01-23 Manifold Learning for Human Population Structure Studies Siu, Hoicheong Jin, Li Xiong, Momiao PLoS One Research Article The dimension of the population genetics data produced by next-generation sequencing platforms is extremely high. However, the “intrinsic dimensionality” of sequence data, which determines the structure of populations, is much lower. This motivates us to use locally linear embedding (LLE) which projects high dimensional genomic data into low dimensional, neighborhood preserving embedding, as a general framework for population structure and historical inference. To facilitate application of the LLE to population genetic analysis, we systematically investigate several important properties of the LLE and reveal the connection between the LLE and principal component analysis (PCA). Identifying a set of markers and genomic regions which could be used for population structure analysis will provide invaluable information for population genetics and association studies. In addition to identifying the LLE-correlated or PCA-correlated structure informative marker, we have developed a new statistic that integrates genomic information content in a genomic region for collectively studying its association with the population structure and LASSO algorithm to search such regions across the genomes. We applied the developed methodologies to a low coverage pilot dataset in the 1000 Genomes Project and a PHASE III Mexico dataset of the HapMap. We observed that 25.1%, 44.9% and 21.4% of the common variants and 89.2%, 92.4% and 75.1% of the rare variants were the LLE-correlated markers in CEU, YRI and ASI, respectively. This showed that rare variants, which are often private to specific populations, have much higher power to identify population substructure than common variants. The preliminary results demonstrated that next generation sequencing offers a rich resources and LLE provide a powerful tool for population structure analysis. Public Library of Science 2012-01-17 /pmc/articles/PMC3260176/ /pubmed/22272259 http://dx.doi.org/10.1371/journal.pone.0029901 Text en Siu et al. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Siu, Hoicheong
Jin, Li
Xiong, Momiao
Manifold Learning for Human Population Structure Studies
title Manifold Learning for Human Population Structure Studies
title_full Manifold Learning for Human Population Structure Studies
title_fullStr Manifold Learning for Human Population Structure Studies
title_full_unstemmed Manifold Learning for Human Population Structure Studies
title_short Manifold Learning for Human Population Structure Studies
title_sort manifold learning for human population structure studies
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3260176/
https://www.ncbi.nlm.nih.gov/pubmed/22272259
http://dx.doi.org/10.1371/journal.pone.0029901
work_keys_str_mv AT siuhoicheong manifoldlearningforhumanpopulationstructurestudies
AT jinli manifoldlearningforhumanpopulationstructurestudies
AT xiongmomiao manifoldlearningforhumanpopulationstructurestudies