Cargando…
GRAF-pop: A Fast Distance-Based Method To Infer Subject Ancestry from Multiple Genotype Datasets Without Principal Components Analysis
Inferring subject ancestry using genetic data is an important step in genetic association studies, required for dealing with population stratification. It has become more challenging to infer subject ancestry quickly and accurately since large amounts of genotype data, collected from millions of sub...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Genetics Society of America
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6686921/ https://www.ncbi.nlm.nih.gov/pubmed/31151998 http://dx.doi.org/10.1534/g3.118.200925 |
_version_ | 1783442641293148160 |
---|---|
author | Jin, Yumi Schaffer, Alejandro A. Feolo, Michael Holmes, J. Bradley Kattman, Brandi L. |
author_facet | Jin, Yumi Schaffer, Alejandro A. Feolo, Michael Holmes, J. Bradley Kattman, Brandi L. |
author_sort | Jin, Yumi |
collection | PubMed |
description | Inferring subject ancestry using genetic data is an important step in genetic association studies, required for dealing with population stratification. It has become more challenging to infer subject ancestry quickly and accurately since large amounts of genotype data, collected from millions of subjects by thousands of studies using different methods, are accessible to researchers from repositories such as the database of Genotypes and Phenotypes (dbGaP) at the National Center for Biotechnology Information (NCBI). Study-reported populations submitted to dbGaP are often not harmonized across studies or may be missing. Widely-used methods for ancestry prediction assume that most markers are genotyped in all subjects, but this assumption is unrealistic if one wants to combine studies that used different genotyping platforms. To provide ancestry inference and visualization across studies, we developed a new method, GRAF-pop, of ancestry prediction that is robust to missing genotypes and allows researchers to visualize predicted population structure in color and in three dimensions. When genotypes are dense, GRAF-pop is comparable in quality and running time to existing ancestry inference methods EIGENSTRAT, FastPCA, and FlashPCA2, all of which rely on principal components analysis (PCA). When genotypes are not dense, GRAF-pop gives much better ancestry predictions than the PCA-based methods. GRAF-pop employs basic geometric and probabilistic methods; the visualized ancestry predictions have a natural geometric interpretation, which is lacking in PCA-based methods. Since February 2018, GRAF-pop has been successfully incorporated into the dbGaP quality control process to identify inconsistencies between study-reported and computationally predicted populations and to provide harmonized population values in all new dbGaP submissions amenable to population prediction, based on marker genotypes. Plots, produced by GRAF-pop, of summary population predictions are available on dbGaP study pages, and the software, is available at https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/Software.cgi. |
format | Online Article Text |
id | pubmed-6686921 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2019 |
publisher | Genetics Society of America |
record_format | MEDLINE/PubMed |
spelling | pubmed-66869212019-08-11 GRAF-pop: A Fast Distance-Based Method To Infer Subject Ancestry from Multiple Genotype Datasets Without Principal Components Analysis Jin, Yumi Schaffer, Alejandro A. Feolo, Michael Holmes, J. Bradley Kattman, Brandi L. G3 (Bethesda) Investigations Inferring subject ancestry using genetic data is an important step in genetic association studies, required for dealing with population stratification. It has become more challenging to infer subject ancestry quickly and accurately since large amounts of genotype data, collected from millions of subjects by thousands of studies using different methods, are accessible to researchers from repositories such as the database of Genotypes and Phenotypes (dbGaP) at the National Center for Biotechnology Information (NCBI). Study-reported populations submitted to dbGaP are often not harmonized across studies or may be missing. Widely-used methods for ancestry prediction assume that most markers are genotyped in all subjects, but this assumption is unrealistic if one wants to combine studies that used different genotyping platforms. To provide ancestry inference and visualization across studies, we developed a new method, GRAF-pop, of ancestry prediction that is robust to missing genotypes and allows researchers to visualize predicted population structure in color and in three dimensions. When genotypes are dense, GRAF-pop is comparable in quality and running time to existing ancestry inference methods EIGENSTRAT, FastPCA, and FlashPCA2, all of which rely on principal components analysis (PCA). When genotypes are not dense, GRAF-pop gives much better ancestry predictions than the PCA-based methods. GRAF-pop employs basic geometric and probabilistic methods; the visualized ancestry predictions have a natural geometric interpretation, which is lacking in PCA-based methods. Since February 2018, GRAF-pop has been successfully incorporated into the dbGaP quality control process to identify inconsistencies between study-reported and computationally predicted populations and to provide harmonized population values in all new dbGaP submissions amenable to population prediction, based on marker genotypes. Plots, produced by GRAF-pop, of summary population predictions are available on dbGaP study pages, and the software, is available at https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/Software.cgi. Genetics Society of America 2019-05-31 /pmc/articles/PMC6686921/ /pubmed/31151998 http://dx.doi.org/10.1534/g3.118.200925 Text en Copyright © 2019 Jin et al. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Investigations Jin, Yumi Schaffer, Alejandro A. Feolo, Michael Holmes, J. Bradley Kattman, Brandi L. GRAF-pop: A Fast Distance-Based Method To Infer Subject Ancestry from Multiple Genotype Datasets Without Principal Components Analysis |
title | GRAF-pop: A Fast Distance-Based Method To Infer Subject Ancestry from Multiple Genotype Datasets Without Principal Components Analysis |
title_full | GRAF-pop: A Fast Distance-Based Method To Infer Subject Ancestry from Multiple Genotype Datasets Without Principal Components Analysis |
title_fullStr | GRAF-pop: A Fast Distance-Based Method To Infer Subject Ancestry from Multiple Genotype Datasets Without Principal Components Analysis |
title_full_unstemmed | GRAF-pop: A Fast Distance-Based Method To Infer Subject Ancestry from Multiple Genotype Datasets Without Principal Components Analysis |
title_short | GRAF-pop: A Fast Distance-Based Method To Infer Subject Ancestry from Multiple Genotype Datasets Without Principal Components Analysis |
title_sort | graf-pop: a fast distance-based method to infer subject ancestry from multiple genotype datasets without principal components analysis |
topic | Investigations |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6686921/ https://www.ncbi.nlm.nih.gov/pubmed/31151998 http://dx.doi.org/10.1534/g3.118.200925 |
work_keys_str_mv | AT jinyumi grafpopafastdistancebasedmethodtoinfersubjectancestryfrommultiplegenotypedatasetswithoutprincipalcomponentsanalysis AT schafferalejandroa grafpopafastdistancebasedmethodtoinfersubjectancestryfrommultiplegenotypedatasetswithoutprincipalcomponentsanalysis AT feolomichael grafpopafastdistancebasedmethodtoinfersubjectancestryfrommultiplegenotypedatasetswithoutprincipalcomponentsanalysis AT holmesjbradley grafpopafastdistancebasedmethodtoinfersubjectancestryfrommultiplegenotypedatasetswithoutprincipalcomponentsanalysis AT kattmanbrandil grafpopafastdistancebasedmethodtoinfersubjectancestryfrommultiplegenotypedatasetswithoutprincipalcomponentsanalysis |