Cargando…

Probabilistic ancestry maps: a method to assess and visualize population substructures in genetics

BACKGROUND: Principal component analysis (PCA) is a standard method to correct for population stratification in ancestry-specific genome-wide association studies (GWASs) and is used to cluster individuals by ancestry. Using the 1000 genomes project data, we examine how non-linear dimensionality redu...

Descripción completa

Detalles Bibliográficos
Autores principales:	Gaspar, Héléna A., Breen, Gerome
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2019
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6407257/ https://www.ncbi.nlm.nih.gov/pubmed/30845922 http://dx.doi.org/10.1186/s12859-019-2680-1

_version_	1783401510820904960
author	Gaspar, Héléna A. Breen, Gerome
author_facet	Gaspar, Héléna A. Breen, Gerome
author_sort	Gaspar, Héléna A.
collection	PubMed
description	BACKGROUND: Principal component analysis (PCA) is a standard method to correct for population stratification in ancestry-specific genome-wide association studies (GWASs) and is used to cluster individuals by ancestry. Using the 1000 genomes project data, we examine how non-linear dimensionality reduction methods such as t-distributed stochastic neighbor embedding (t-SNE) or generative topographic mapping (GTM) can be used to provide improved ancestry maps by accounting for a higher percentage of explained variance in ancestry, and how they can help to estimate the number of principal components necessary to account for population stratification. GTM generates posterior probabilities of class membership which can be used to assess the probability of an individual to belong to a given population - as opposed to t-SNE, GTM can be used for both clustering and classification. RESULTS: PCA only partially identifies population clusters and does not separate most populations within a given continent, such as Japanese and Han Chinese in East Asia, or Mende and Yoruba in Africa. t-SNE and GTM, taking into account more data variance, can identify more fine-grained population clusters. GTM can be used to build probabilistic classification models, and is as efficient as support vector machine (SVM) for classifying 1000 Genomes Project populations. CONCLUSION: The main interest of probabilistic GTM maps is to attain two objectives with only one map: provide a better visualization that separates populations efficiently, and infer genetic ancestry for individuals or populations. This paper is a first application of GTM for ancestry classification models. Our code (https://github.com/hagax8/ancestry_viz) and interactive visualizations (https://lovingscience.com/ancestries) are available online. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-019-2680-1) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-6407257
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-64072572019-03-21 Probabilistic ancestry maps: a method to assess and visualize population substructures in genetics Gaspar, Héléna A. Breen, Gerome BMC Bioinformatics Methodology Article BACKGROUND: Principal component analysis (PCA) is a standard method to correct for population stratification in ancestry-specific genome-wide association studies (GWASs) and is used to cluster individuals by ancestry. Using the 1000 genomes project data, we examine how non-linear dimensionality reduction methods such as t-distributed stochastic neighbor embedding (t-SNE) or generative topographic mapping (GTM) can be used to provide improved ancestry maps by accounting for a higher percentage of explained variance in ancestry, and how they can help to estimate the number of principal components necessary to account for population stratification. GTM generates posterior probabilities of class membership which can be used to assess the probability of an individual to belong to a given population - as opposed to t-SNE, GTM can be used for both clustering and classification. RESULTS: PCA only partially identifies population clusters and does not separate most populations within a given continent, such as Japanese and Han Chinese in East Asia, or Mende and Yoruba in Africa. t-SNE and GTM, taking into account more data variance, can identify more fine-grained population clusters. GTM can be used to build probabilistic classification models, and is as efficient as support vector machine (SVM) for classifying 1000 Genomes Project populations. CONCLUSION: The main interest of probabilistic GTM maps is to attain two objectives with only one map: provide a better visualization that separates populations efficiently, and infer genetic ancestry for individuals or populations. This paper is a first application of GTM for ancestry classification models. Our code (https://github.com/hagax8/ancestry_viz) and interactive visualizations (https://lovingscience.com/ancestries) are available online. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-019-2680-1) contains supplementary material, which is available to authorized users. BioMed Central 2019-03-07 /pmc/articles/PMC6407257/ /pubmed/30845922 http://dx.doi.org/10.1186/s12859-019-2680-1 Text en © The Author(s) 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Methodology Article Gaspar, Héléna A. Breen, Gerome Probabilistic ancestry maps: a method to assess and visualize population substructures in genetics
title	Probabilistic ancestry maps: a method to assess and visualize population substructures in genetics
title_full	Probabilistic ancestry maps: a method to assess and visualize population substructures in genetics
title_fullStr	Probabilistic ancestry maps: a method to assess and visualize population substructures in genetics
title_full_unstemmed	Probabilistic ancestry maps: a method to assess and visualize population substructures in genetics
title_short	Probabilistic ancestry maps: a method to assess and visualize population substructures in genetics
title_sort	probabilistic ancestry maps: a method to assess and visualize population substructures in genetics
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6407257/ https://www.ncbi.nlm.nih.gov/pubmed/30845922 http://dx.doi.org/10.1186/s12859-019-2680-1
work_keys_str_mv	AT gasparhelenaa probabilisticancestrymapsamethodtoassessandvisualizepopulationsubstructuresingenetics AT breengerome probabilisticancestrymapsamethodtoassessandvisualizepopulationsubstructuresingenetics

Probabilistic ancestry maps: a method to assess and visualize population substructures in genetics

Ejemplares similares