Cargando…

GRAF-pop: A Fast Distance-Based Method To Infer Subject Ancestry from Multiple Genotype Datasets Without Principal Components Analysis

Inferring subject ancestry using genetic data is an important step in genetic association studies, required for dealing with population stratification. It has become more challenging to infer subject ancestry quickly and accurately since large amounts of genotype data, collected from millions of sub...

Descripción completa

Detalles Bibliográficos
Autores principales: Jin, Yumi, Schaffer, Alejandro A., Feolo, Michael, Holmes, J. Bradley, Kattman, Brandi L.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Genetics Society of America 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6686921/
https://www.ncbi.nlm.nih.gov/pubmed/31151998
http://dx.doi.org/10.1534/g3.118.200925
_version_ 1783442641293148160
author Jin, Yumi
Schaffer, Alejandro A.
Feolo, Michael
Holmes, J. Bradley
Kattman, Brandi L.
author_facet Jin, Yumi
Schaffer, Alejandro A.
Feolo, Michael
Holmes, J. Bradley
Kattman, Brandi L.
author_sort Jin, Yumi
collection PubMed
description Inferring subject ancestry using genetic data is an important step in genetic association studies, required for dealing with population stratification. It has become more challenging to infer subject ancestry quickly and accurately since large amounts of genotype data, collected from millions of subjects by thousands of studies using different methods, are accessible to researchers from repositories such as the database of Genotypes and Phenotypes (dbGaP) at the National Center for Biotechnology Information (NCBI). Study-reported populations submitted to dbGaP are often not harmonized across studies or may be missing. Widely-used methods for ancestry prediction assume that most markers are genotyped in all subjects, but this assumption is unrealistic if one wants to combine studies that used different genotyping platforms. To provide ancestry inference and visualization across studies, we developed a new method, GRAF-pop, of ancestry prediction that is robust to missing genotypes and allows researchers to visualize predicted population structure in color and in three dimensions. When genotypes are dense, GRAF-pop is comparable in quality and running time to existing ancestry inference methods EIGENSTRAT, FastPCA, and FlashPCA2, all of which rely on principal components analysis (PCA). When genotypes are not dense, GRAF-pop gives much better ancestry predictions than the PCA-based methods. GRAF-pop employs basic geometric and probabilistic methods; the visualized ancestry predictions have a natural geometric interpretation, which is lacking in PCA-based methods. Since February 2018, GRAF-pop has been successfully incorporated into the dbGaP quality control process to identify inconsistencies between study-reported and computationally predicted populations and to provide harmonized population values in all new dbGaP submissions amenable to population prediction, based on marker genotypes. Plots, produced by GRAF-pop, of summary population predictions are available on dbGaP study pages, and the software, is available at https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/Software.cgi.
format Online
Article
Text
id pubmed-6686921
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Genetics Society of America
record_format MEDLINE/PubMed
spelling pubmed-66869212019-08-11 GRAF-pop: A Fast Distance-Based Method To Infer Subject Ancestry from Multiple Genotype Datasets Without Principal Components Analysis Jin, Yumi Schaffer, Alejandro A. Feolo, Michael Holmes, J. Bradley Kattman, Brandi L. G3 (Bethesda) Investigations Inferring subject ancestry using genetic data is an important step in genetic association studies, required for dealing with population stratification. It has become more challenging to infer subject ancestry quickly and accurately since large amounts of genotype data, collected from millions of subjects by thousands of studies using different methods, are accessible to researchers from repositories such as the database of Genotypes and Phenotypes (dbGaP) at the National Center for Biotechnology Information (NCBI). Study-reported populations submitted to dbGaP are often not harmonized across studies or may be missing. Widely-used methods for ancestry prediction assume that most markers are genotyped in all subjects, but this assumption is unrealistic if one wants to combine studies that used different genotyping platforms. To provide ancestry inference and visualization across studies, we developed a new method, GRAF-pop, of ancestry prediction that is robust to missing genotypes and allows researchers to visualize predicted population structure in color and in three dimensions. When genotypes are dense, GRAF-pop is comparable in quality and running time to existing ancestry inference methods EIGENSTRAT, FastPCA, and FlashPCA2, all of which rely on principal components analysis (PCA). When genotypes are not dense, GRAF-pop gives much better ancestry predictions than the PCA-based methods. GRAF-pop employs basic geometric and probabilistic methods; the visualized ancestry predictions have a natural geometric interpretation, which is lacking in PCA-based methods. Since February 2018, GRAF-pop has been successfully incorporated into the dbGaP quality control process to identify inconsistencies between study-reported and computationally predicted populations and to provide harmonized population values in all new dbGaP submissions amenable to population prediction, based on marker genotypes. Plots, produced by GRAF-pop, of summary population predictions are available on dbGaP study pages, and the software, is available at https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/Software.cgi. Genetics Society of America 2019-05-31 /pmc/articles/PMC6686921/ /pubmed/31151998 http://dx.doi.org/10.1534/g3.118.200925 Text en Copyright © 2019 Jin et al. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Investigations
Jin, Yumi
Schaffer, Alejandro A.
Feolo, Michael
Holmes, J. Bradley
Kattman, Brandi L.
GRAF-pop: A Fast Distance-Based Method To Infer Subject Ancestry from Multiple Genotype Datasets Without Principal Components Analysis
title GRAF-pop: A Fast Distance-Based Method To Infer Subject Ancestry from Multiple Genotype Datasets Without Principal Components Analysis
title_full GRAF-pop: A Fast Distance-Based Method To Infer Subject Ancestry from Multiple Genotype Datasets Without Principal Components Analysis
title_fullStr GRAF-pop: A Fast Distance-Based Method To Infer Subject Ancestry from Multiple Genotype Datasets Without Principal Components Analysis
title_full_unstemmed GRAF-pop: A Fast Distance-Based Method To Infer Subject Ancestry from Multiple Genotype Datasets Without Principal Components Analysis
title_short GRAF-pop: A Fast Distance-Based Method To Infer Subject Ancestry from Multiple Genotype Datasets Without Principal Components Analysis
title_sort graf-pop: a fast distance-based method to infer subject ancestry from multiple genotype datasets without principal components analysis
topic Investigations
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6686921/
https://www.ncbi.nlm.nih.gov/pubmed/31151998
http://dx.doi.org/10.1534/g3.118.200925
work_keys_str_mv AT jinyumi grafpopafastdistancebasedmethodtoinfersubjectancestryfrommultiplegenotypedatasetswithoutprincipalcomponentsanalysis
AT schafferalejandroa grafpopafastdistancebasedmethodtoinfersubjectancestryfrommultiplegenotypedatasetswithoutprincipalcomponentsanalysis
AT feolomichael grafpopafastdistancebasedmethodtoinfersubjectancestryfrommultiplegenotypedatasetswithoutprincipalcomponentsanalysis
AT holmesjbradley grafpopafastdistancebasedmethodtoinfersubjectancestryfrommultiplegenotypedatasetswithoutprincipalcomponentsanalysis
AT kattmanbrandil grafpopafastdistancebasedmethodtoinfersubjectancestryfrommultiplegenotypedatasetswithoutprincipalcomponentsanalysis