Cargando…

Controlling for population structure and genotyping platform bias in the eMERGE multi-institutional biobank linked to electronic health records

Combining samples across multiple cohorts in large-scale scientific research programs is often required to achieve the necessary power for genome-wide association studies. Controlling for genomic ancestry through principal component analysis (PCA) to address the effect of population stratification i...

Descripción completa

Detalles Bibliográficos
Autores principales: Crosslin, David R., Tromp, Gerard, Burt, Amber, Kim, Daniel S., Verma, Shefali S., Lucas, Anastasia M., Bradford, Yuki, Crawford, Dana C., Armasu, Sebastian M., Heit, John A., Hayes, M. Geoffrey, Kuivaniemi, Helena, Ritchie, Marylyn D., Jarvik, Gail P., de Andrade, Mariza
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4220165/
https://www.ncbi.nlm.nih.gov/pubmed/25414722
http://dx.doi.org/10.3389/fgene.2014.00352
_version_ 1782342705010966528
author Crosslin, David R.
Tromp, Gerard
Burt, Amber
Kim, Daniel S.
Verma, Shefali S.
Lucas, Anastasia M.
Bradford, Yuki
Crawford, Dana C.
Armasu, Sebastian M.
Heit, John A.
Hayes, M. Geoffrey
Kuivaniemi, Helena
Ritchie, Marylyn D.
Jarvik, Gail P.
de Andrade, Mariza
author_facet Crosslin, David R.
Tromp, Gerard
Burt, Amber
Kim, Daniel S.
Verma, Shefali S.
Lucas, Anastasia M.
Bradford, Yuki
Crawford, Dana C.
Armasu, Sebastian M.
Heit, John A.
Hayes, M. Geoffrey
Kuivaniemi, Helena
Ritchie, Marylyn D.
Jarvik, Gail P.
de Andrade, Mariza
author_sort Crosslin, David R.
collection PubMed
description Combining samples across multiple cohorts in large-scale scientific research programs is often required to achieve the necessary power for genome-wide association studies. Controlling for genomic ancestry through principal component analysis (PCA) to address the effect of population stratification is a common practice. In addition to local genomic variation, such as copy number variation and inversions, other factors directly related to combining multiple studies, such as platform and site recruitment bias, can drive the correlation patterns in PCA. In this report, we describe the combination and analysis of multi-ethnic cohort with biobanks linked to electronic health records for large-scale genomic association discovery analyses. First, we outline the observed site and platform bias, in addition to ancestry differences. Second, we outline a general protocol for selecting variants for input into the subject variance-covariance matrix, the conventional PCA approach. Finally, we introduce an alternative approach to PCA by deriving components from subject loadings calculated from a reference sample. This alternative approach of generating principal components controlled for site and platform bias, in addition to ancestry differences, has the advantage of fewer covariates and degrees of freedom.
format Online
Article
Text
id pubmed-4220165
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-42201652014-11-20 Controlling for population structure and genotyping platform bias in the eMERGE multi-institutional biobank linked to electronic health records Crosslin, David R. Tromp, Gerard Burt, Amber Kim, Daniel S. Verma, Shefali S. Lucas, Anastasia M. Bradford, Yuki Crawford, Dana C. Armasu, Sebastian M. Heit, John A. Hayes, M. Geoffrey Kuivaniemi, Helena Ritchie, Marylyn D. Jarvik, Gail P. de Andrade, Mariza Front Genet Genetics Combining samples across multiple cohorts in large-scale scientific research programs is often required to achieve the necessary power for genome-wide association studies. Controlling for genomic ancestry through principal component analysis (PCA) to address the effect of population stratification is a common practice. In addition to local genomic variation, such as copy number variation and inversions, other factors directly related to combining multiple studies, such as platform and site recruitment bias, can drive the correlation patterns in PCA. In this report, we describe the combination and analysis of multi-ethnic cohort with biobanks linked to electronic health records for large-scale genomic association discovery analyses. First, we outline the observed site and platform bias, in addition to ancestry differences. Second, we outline a general protocol for selecting variants for input into the subject variance-covariance matrix, the conventional PCA approach. Finally, we introduce an alternative approach to PCA by deriving components from subject loadings calculated from a reference sample. This alternative approach of generating principal components controlled for site and platform bias, in addition to ancestry differences, has the advantage of fewer covariates and degrees of freedom. Frontiers Media S.A. 2014-11-04 /pmc/articles/PMC4220165/ /pubmed/25414722 http://dx.doi.org/10.3389/fgene.2014.00352 Text en Copyright © 2014 Crosslin, Tromp, Burt, Kim, Verma, Lucas, Bradford, Crawford, Armasu, Heit, Hayes, Kuivaniemi, Ritchie, Jarvik and de Andrade. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Genetics
Crosslin, David R.
Tromp, Gerard
Burt, Amber
Kim, Daniel S.
Verma, Shefali S.
Lucas, Anastasia M.
Bradford, Yuki
Crawford, Dana C.
Armasu, Sebastian M.
Heit, John A.
Hayes, M. Geoffrey
Kuivaniemi, Helena
Ritchie, Marylyn D.
Jarvik, Gail P.
de Andrade, Mariza
Controlling for population structure and genotyping platform bias in the eMERGE multi-institutional biobank linked to electronic health records
title Controlling for population structure and genotyping platform bias in the eMERGE multi-institutional biobank linked to electronic health records
title_full Controlling for population structure and genotyping platform bias in the eMERGE multi-institutional biobank linked to electronic health records
title_fullStr Controlling for population structure and genotyping platform bias in the eMERGE multi-institutional biobank linked to electronic health records
title_full_unstemmed Controlling for population structure and genotyping platform bias in the eMERGE multi-institutional biobank linked to electronic health records
title_short Controlling for population structure and genotyping platform bias in the eMERGE multi-institutional biobank linked to electronic health records
title_sort controlling for population structure and genotyping platform bias in the emerge multi-institutional biobank linked to electronic health records
topic Genetics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4220165/
https://www.ncbi.nlm.nih.gov/pubmed/25414722
http://dx.doi.org/10.3389/fgene.2014.00352
work_keys_str_mv AT crosslindavidr controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords
AT trompgerard controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords
AT burtamber controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords
AT kimdaniels controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords
AT vermashefalis controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords
AT lucasanastasiam controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords
AT bradfordyuki controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords
AT crawforddanac controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords
AT armasusebastianm controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords
AT heitjohna controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords
AT hayesmgeoffrey controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords
AT kuivaniemihelena controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords
AT ritchiemarylynd controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords
AT jarvikgailp controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords
AT deandrademariza controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords
AT controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords