Cargando…

Imputation and quality control steps for combining multiple genome-wide datasets

The electronic MEdical Records and GEnomics (eMERGE) network brings together DNA biobanks linked to electronic health records (EHRs) from multiple institutions. Approximately 51,000 DNA samples from distinct individuals have been genotyped using genome-wide SNP arrays across the nine sites of the ne...

Descripción completa

Detalles Bibliográficos
Autores principales: Verma, Shefali S., de Andrade, Mariza, Tromp, Gerard, Kuivaniemi, Helena, Pugh, Elizabeth, Namjou-Khales, Bahram, Mukherjee, Shubhabrata, Jarvik, Gail P., Kottyan, Leah C., Burt, Amber, Bradford, Yuki, Armstrong, Gretta D., Derr, Kimberly, Crawford, Dana C., Haines, Jonathan L., Li, Rongling, Crosslin, David, Ritchie, Marylyn D.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4263197/
https://www.ncbi.nlm.nih.gov/pubmed/25566314
http://dx.doi.org/10.3389/fgene.2014.00370
_version_ 1782348530633932800
author Verma, Shefali S.
de Andrade, Mariza
Tromp, Gerard
Kuivaniemi, Helena
Pugh, Elizabeth
Namjou-Khales, Bahram
Mukherjee, Shubhabrata
Jarvik, Gail P.
Kottyan, Leah C.
Burt, Amber
Bradford, Yuki
Armstrong, Gretta D.
Derr, Kimberly
Crawford, Dana C.
Haines, Jonathan L.
Li, Rongling
Crosslin, David
Ritchie, Marylyn D.
author_facet Verma, Shefali S.
de Andrade, Mariza
Tromp, Gerard
Kuivaniemi, Helena
Pugh, Elizabeth
Namjou-Khales, Bahram
Mukherjee, Shubhabrata
Jarvik, Gail P.
Kottyan, Leah C.
Burt, Amber
Bradford, Yuki
Armstrong, Gretta D.
Derr, Kimberly
Crawford, Dana C.
Haines, Jonathan L.
Li, Rongling
Crosslin, David
Ritchie, Marylyn D.
author_sort Verma, Shefali S.
collection PubMed
description The electronic MEdical Records and GEnomics (eMERGE) network brings together DNA biobanks linked to electronic health records (EHRs) from multiple institutions. Approximately 51,000 DNA samples from distinct individuals have been genotyped using genome-wide SNP arrays across the nine sites of the network. The eMERGE Coordinating Center and the Genomics Workgroup developed a pipeline to impute and merge genomic data across the different SNP arrays to maximize sample size and power to detect associations with a variety of clinical endpoints. The 1000 Genomes cosmopolitan reference panel was used for imputation. Imputation results were evaluated using the following metrics: accuracy of imputation, allelic R(2) (estimated correlation between the imputed and true genotypes), and the relationship between allelic R(2) and minor allele frequency. Computation time and memory resources required by two different software packages (BEAGLE and IMPUTE2) were also evaluated. A number of challenges were encountered due to the complexity of using two different imputation software packages, multiple ancestral populations, and many different genotyping platforms. We present lessons learned and describe the pipeline implemented here to impute and merge genomic data sets. The eMERGE imputed dataset will serve as a valuable resource for discovery, leveraging the clinical data that can be mined from the EHR.
format Online
Article
Text
id pubmed-4263197
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-42631972015-01-06 Imputation and quality control steps for combining multiple genome-wide datasets Verma, Shefali S. de Andrade, Mariza Tromp, Gerard Kuivaniemi, Helena Pugh, Elizabeth Namjou-Khales, Bahram Mukherjee, Shubhabrata Jarvik, Gail P. Kottyan, Leah C. Burt, Amber Bradford, Yuki Armstrong, Gretta D. Derr, Kimberly Crawford, Dana C. Haines, Jonathan L. Li, Rongling Crosslin, David Ritchie, Marylyn D. Front Genet Genetics The electronic MEdical Records and GEnomics (eMERGE) network brings together DNA biobanks linked to electronic health records (EHRs) from multiple institutions. Approximately 51,000 DNA samples from distinct individuals have been genotyped using genome-wide SNP arrays across the nine sites of the network. The eMERGE Coordinating Center and the Genomics Workgroup developed a pipeline to impute and merge genomic data across the different SNP arrays to maximize sample size and power to detect associations with a variety of clinical endpoints. The 1000 Genomes cosmopolitan reference panel was used for imputation. Imputation results were evaluated using the following metrics: accuracy of imputation, allelic R(2) (estimated correlation between the imputed and true genotypes), and the relationship between allelic R(2) and minor allele frequency. Computation time and memory resources required by two different software packages (BEAGLE and IMPUTE2) were also evaluated. A number of challenges were encountered due to the complexity of using two different imputation software packages, multiple ancestral populations, and many different genotyping platforms. We present lessons learned and describe the pipeline implemented here to impute and merge genomic data sets. The eMERGE imputed dataset will serve as a valuable resource for discovery, leveraging the clinical data that can be mined from the EHR. Frontiers Media S.A. 2014-12-11 /pmc/articles/PMC4263197/ /pubmed/25566314 http://dx.doi.org/10.3389/fgene.2014.00370 Text en Copyright © 2014 Verma, de Andrade, Tromp, Kuivaniemi, Pugh, Namjou-Khales, Mukherjee, Jarvik, Kottyan, Burt, Bradford, Armstrong, Derr, Crawford, Haines, Li, Crosslin and Ritchie. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Genetics
Verma, Shefali S.
de Andrade, Mariza
Tromp, Gerard
Kuivaniemi, Helena
Pugh, Elizabeth
Namjou-Khales, Bahram
Mukherjee, Shubhabrata
Jarvik, Gail P.
Kottyan, Leah C.
Burt, Amber
Bradford, Yuki
Armstrong, Gretta D.
Derr, Kimberly
Crawford, Dana C.
Haines, Jonathan L.
Li, Rongling
Crosslin, David
Ritchie, Marylyn D.
Imputation and quality control steps for combining multiple genome-wide datasets
title Imputation and quality control steps for combining multiple genome-wide datasets
title_full Imputation and quality control steps for combining multiple genome-wide datasets
title_fullStr Imputation and quality control steps for combining multiple genome-wide datasets
title_full_unstemmed Imputation and quality control steps for combining multiple genome-wide datasets
title_short Imputation and quality control steps for combining multiple genome-wide datasets
title_sort imputation and quality control steps for combining multiple genome-wide datasets
topic Genetics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4263197/
https://www.ncbi.nlm.nih.gov/pubmed/25566314
http://dx.doi.org/10.3389/fgene.2014.00370
work_keys_str_mv AT vermashefalis imputationandqualitycontrolstepsforcombiningmultiplegenomewidedatasets
AT deandrademariza imputationandqualitycontrolstepsforcombiningmultiplegenomewidedatasets
AT trompgerard imputationandqualitycontrolstepsforcombiningmultiplegenomewidedatasets
AT kuivaniemihelena imputationandqualitycontrolstepsforcombiningmultiplegenomewidedatasets
AT pughelizabeth imputationandqualitycontrolstepsforcombiningmultiplegenomewidedatasets
AT namjoukhalesbahram imputationandqualitycontrolstepsforcombiningmultiplegenomewidedatasets
AT mukherjeeshubhabrata imputationandqualitycontrolstepsforcombiningmultiplegenomewidedatasets
AT jarvikgailp imputationandqualitycontrolstepsforcombiningmultiplegenomewidedatasets
AT kottyanleahc imputationandqualitycontrolstepsforcombiningmultiplegenomewidedatasets
AT burtamber imputationandqualitycontrolstepsforcombiningmultiplegenomewidedatasets
AT bradfordyuki imputationandqualitycontrolstepsforcombiningmultiplegenomewidedatasets
AT armstronggrettad imputationandqualitycontrolstepsforcombiningmultiplegenomewidedatasets
AT derrkimberly imputationandqualitycontrolstepsforcombiningmultiplegenomewidedatasets
AT crawforddanac imputationandqualitycontrolstepsforcombiningmultiplegenomewidedatasets
AT hainesjonathanl imputationandqualitycontrolstepsforcombiningmultiplegenomewidedatasets
AT lirongling imputationandqualitycontrolstepsforcombiningmultiplegenomewidedatasets
AT crosslindavid imputationandqualitycontrolstepsforcombiningmultiplegenomewidedatasets
AT ritchiemarylynd imputationandqualitycontrolstepsforcombiningmultiplegenomewidedatasets