Cargando…

A robust clustering algorithm for identifying problematic samples in genome-wide association studies

Summary: High-throughput genotyping arrays provide an efficient way to survey single nucleotide polymorphisms (SNPs) across the genome in large numbers of individuals. Downstream analysis of the data, for example in genome-wide association studies (GWAS), often involves statistical models of genotyp...

Descripción completa

Detalles Bibliográficos
Autores principales: Bellenguez, Céline, Strange, Amy, Freeman, Colin, Donnelly, Peter, Spencer, Chris C.A.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3244763/
https://www.ncbi.nlm.nih.gov/pubmed/22057162
http://dx.doi.org/10.1093/bioinformatics/btr599
_version_ 1782219754135617536
author Bellenguez, Céline
Strange, Amy
Freeman, Colin
Donnelly, Peter
Spencer, Chris C.A.
author_facet Bellenguez, Céline
Strange, Amy
Freeman, Colin
Donnelly, Peter
Spencer, Chris C.A.
author_sort Bellenguez, Céline
collection PubMed
description Summary: High-throughput genotyping arrays provide an efficient way to survey single nucleotide polymorphisms (SNPs) across the genome in large numbers of individuals. Downstream analysis of the data, for example in genome-wide association studies (GWAS), often involves statistical models of genotype frequencies across individuals. The complexities of the sample collection process and the potential for errors in the experimental assay can lead to biases and artefacts in an individual's inferred genotypes. Rather than attempting to model these complications, it has become a standard practice to remove individuals whose genome-wide data differ from the sample at large. Here we describe a simple, but robust, statistical algorithm to identify samples with atypical summaries of genome-wide variation. Its use as a semi-automated quality control tool is demonstrated using several summary statistics, selected to identify different potential problems, and it is applied to two different genotyping platforms and sample collections. Availability: The algorithm is written in R and is freely available at www.well.ox.ac.uk/chris-spencer Contact: chris.spencer@well.ox.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-3244763
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-32447632011-12-22 A robust clustering algorithm for identifying problematic samples in genome-wide association studies Bellenguez, Céline Strange, Amy Freeman, Colin Donnelly, Peter Spencer, Chris C.A. Bioinformatics Applications Note Summary: High-throughput genotyping arrays provide an efficient way to survey single nucleotide polymorphisms (SNPs) across the genome in large numbers of individuals. Downstream analysis of the data, for example in genome-wide association studies (GWAS), often involves statistical models of genotype frequencies across individuals. The complexities of the sample collection process and the potential for errors in the experimental assay can lead to biases and artefacts in an individual's inferred genotypes. Rather than attempting to model these complications, it has become a standard practice to remove individuals whose genome-wide data differ from the sample at large. Here we describe a simple, but robust, statistical algorithm to identify samples with atypical summaries of genome-wide variation. Its use as a semi-automated quality control tool is demonstrated using several summary statistics, selected to identify different potential problems, and it is applied to two different genotyping platforms and sample collections. Availability: The algorithm is written in R and is freely available at www.well.ox.ac.uk/chris-spencer Contact: chris.spencer@well.ox.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online. Oxford University Press 2012-01-01 2011-11-03 /pmc/articles/PMC3244763/ /pubmed/22057162 http://dx.doi.org/10.1093/bioinformatics/btr599 Text en © The Author(s) 2011. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/3.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Applications Note
Bellenguez, Céline
Strange, Amy
Freeman, Colin
Donnelly, Peter
Spencer, Chris C.A.
A robust clustering algorithm for identifying problematic samples in genome-wide association studies
title A robust clustering algorithm for identifying problematic samples in genome-wide association studies
title_full A robust clustering algorithm for identifying problematic samples in genome-wide association studies
title_fullStr A robust clustering algorithm for identifying problematic samples in genome-wide association studies
title_full_unstemmed A robust clustering algorithm for identifying problematic samples in genome-wide association studies
title_short A robust clustering algorithm for identifying problematic samples in genome-wide association studies
title_sort robust clustering algorithm for identifying problematic samples in genome-wide association studies
topic Applications Note
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3244763/
https://www.ncbi.nlm.nih.gov/pubmed/22057162
http://dx.doi.org/10.1093/bioinformatics/btr599
work_keys_str_mv AT bellenguezceline arobustclusteringalgorithmforidentifyingproblematicsamplesingenomewideassociationstudies
AT strangeamy arobustclusteringalgorithmforidentifyingproblematicsamplesingenomewideassociationstudies
AT freemancolin arobustclusteringalgorithmforidentifyingproblematicsamplesingenomewideassociationstudies
AT arobustclusteringalgorithmforidentifyingproblematicsamplesingenomewideassociationstudies
AT donnellypeter arobustclusteringalgorithmforidentifyingproblematicsamplesingenomewideassociationstudies
AT spencerchrisca arobustclusteringalgorithmforidentifyingproblematicsamplesingenomewideassociationstudies
AT bellenguezceline robustclusteringalgorithmforidentifyingproblematicsamplesingenomewideassociationstudies
AT strangeamy robustclusteringalgorithmforidentifyingproblematicsamplesingenomewideassociationstudies
AT freemancolin robustclusteringalgorithmforidentifyingproblematicsamplesingenomewideassociationstudies
AT robustclusteringalgorithmforidentifyingproblematicsamplesingenomewideassociationstudies
AT donnellypeter robustclusteringalgorithmforidentifyingproblematicsamplesingenomewideassociationstudies
AT spencerchrisca robustclusteringalgorithmforidentifyingproblematicsamplesingenomewideassociationstudies