Cargando…
Imputation and quality control steps for combining multiple genome-wide datasets
The electronic MEdical Records and GEnomics (eMERGE) network brings together DNA biobanks linked to electronic health records (EHRs) from multiple institutions. Approximately 51,000 DNA samples from distinct individuals have been genotyped using genome-wide SNP arrays across the nine sites of the ne...
Autores principales: | , , , , , , , , , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Frontiers Media S.A.
2014
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4263197/ https://www.ncbi.nlm.nih.gov/pubmed/25566314 http://dx.doi.org/10.3389/fgene.2014.00370 |
_version_ | 1782348530633932800 |
---|---|
author | Verma, Shefali S. de Andrade, Mariza Tromp, Gerard Kuivaniemi, Helena Pugh, Elizabeth Namjou-Khales, Bahram Mukherjee, Shubhabrata Jarvik, Gail P. Kottyan, Leah C. Burt, Amber Bradford, Yuki Armstrong, Gretta D. Derr, Kimberly Crawford, Dana C. Haines, Jonathan L. Li, Rongling Crosslin, David Ritchie, Marylyn D. |
author_facet | Verma, Shefali S. de Andrade, Mariza Tromp, Gerard Kuivaniemi, Helena Pugh, Elizabeth Namjou-Khales, Bahram Mukherjee, Shubhabrata Jarvik, Gail P. Kottyan, Leah C. Burt, Amber Bradford, Yuki Armstrong, Gretta D. Derr, Kimberly Crawford, Dana C. Haines, Jonathan L. Li, Rongling Crosslin, David Ritchie, Marylyn D. |
author_sort | Verma, Shefali S. |
collection | PubMed |
description | The electronic MEdical Records and GEnomics (eMERGE) network brings together DNA biobanks linked to electronic health records (EHRs) from multiple institutions. Approximately 51,000 DNA samples from distinct individuals have been genotyped using genome-wide SNP arrays across the nine sites of the network. The eMERGE Coordinating Center and the Genomics Workgroup developed a pipeline to impute and merge genomic data across the different SNP arrays to maximize sample size and power to detect associations with a variety of clinical endpoints. The 1000 Genomes cosmopolitan reference panel was used for imputation. Imputation results were evaluated using the following metrics: accuracy of imputation, allelic R(2) (estimated correlation between the imputed and true genotypes), and the relationship between allelic R(2) and minor allele frequency. Computation time and memory resources required by two different software packages (BEAGLE and IMPUTE2) were also evaluated. A number of challenges were encountered due to the complexity of using two different imputation software packages, multiple ancestral populations, and many different genotyping platforms. We present lessons learned and describe the pipeline implemented here to impute and merge genomic data sets. The eMERGE imputed dataset will serve as a valuable resource for discovery, leveraging the clinical data that can be mined from the EHR. |
format | Online Article Text |
id | pubmed-4263197 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2014 |
publisher | Frontiers Media S.A. |
record_format | MEDLINE/PubMed |
spelling | pubmed-42631972015-01-06 Imputation and quality control steps for combining multiple genome-wide datasets Verma, Shefali S. de Andrade, Mariza Tromp, Gerard Kuivaniemi, Helena Pugh, Elizabeth Namjou-Khales, Bahram Mukherjee, Shubhabrata Jarvik, Gail P. Kottyan, Leah C. Burt, Amber Bradford, Yuki Armstrong, Gretta D. Derr, Kimberly Crawford, Dana C. Haines, Jonathan L. Li, Rongling Crosslin, David Ritchie, Marylyn D. Front Genet Genetics The electronic MEdical Records and GEnomics (eMERGE) network brings together DNA biobanks linked to electronic health records (EHRs) from multiple institutions. Approximately 51,000 DNA samples from distinct individuals have been genotyped using genome-wide SNP arrays across the nine sites of the network. The eMERGE Coordinating Center and the Genomics Workgroup developed a pipeline to impute and merge genomic data across the different SNP arrays to maximize sample size and power to detect associations with a variety of clinical endpoints. The 1000 Genomes cosmopolitan reference panel was used for imputation. Imputation results were evaluated using the following metrics: accuracy of imputation, allelic R(2) (estimated correlation between the imputed and true genotypes), and the relationship between allelic R(2) and minor allele frequency. Computation time and memory resources required by two different software packages (BEAGLE and IMPUTE2) were also evaluated. A number of challenges were encountered due to the complexity of using two different imputation software packages, multiple ancestral populations, and many different genotyping platforms. We present lessons learned and describe the pipeline implemented here to impute and merge genomic data sets. The eMERGE imputed dataset will serve as a valuable resource for discovery, leveraging the clinical data that can be mined from the EHR. Frontiers Media S.A. 2014-12-11 /pmc/articles/PMC4263197/ /pubmed/25566314 http://dx.doi.org/10.3389/fgene.2014.00370 Text en Copyright © 2014 Verma, de Andrade, Tromp, Kuivaniemi, Pugh, Namjou-Khales, Mukherjee, Jarvik, Kottyan, Burt, Bradford, Armstrong, Derr, Crawford, Haines, Li, Crosslin and Ritchie. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms. |
spellingShingle | Genetics Verma, Shefali S. de Andrade, Mariza Tromp, Gerard Kuivaniemi, Helena Pugh, Elizabeth Namjou-Khales, Bahram Mukherjee, Shubhabrata Jarvik, Gail P. Kottyan, Leah C. Burt, Amber Bradford, Yuki Armstrong, Gretta D. Derr, Kimberly Crawford, Dana C. Haines, Jonathan L. Li, Rongling Crosslin, David Ritchie, Marylyn D. Imputation and quality control steps for combining multiple genome-wide datasets |
title | Imputation and quality control steps for combining multiple genome-wide datasets |
title_full | Imputation and quality control steps for combining multiple genome-wide datasets |
title_fullStr | Imputation and quality control steps for combining multiple genome-wide datasets |
title_full_unstemmed | Imputation and quality control steps for combining multiple genome-wide datasets |
title_short | Imputation and quality control steps for combining multiple genome-wide datasets |
title_sort | imputation and quality control steps for combining multiple genome-wide datasets |
topic | Genetics |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4263197/ https://www.ncbi.nlm.nih.gov/pubmed/25566314 http://dx.doi.org/10.3389/fgene.2014.00370 |
work_keys_str_mv | AT vermashefalis imputationandqualitycontrolstepsforcombiningmultiplegenomewidedatasets AT deandrademariza imputationandqualitycontrolstepsforcombiningmultiplegenomewidedatasets AT trompgerard imputationandqualitycontrolstepsforcombiningmultiplegenomewidedatasets AT kuivaniemihelena imputationandqualitycontrolstepsforcombiningmultiplegenomewidedatasets AT pughelizabeth imputationandqualitycontrolstepsforcombiningmultiplegenomewidedatasets AT namjoukhalesbahram imputationandqualitycontrolstepsforcombiningmultiplegenomewidedatasets AT mukherjeeshubhabrata imputationandqualitycontrolstepsforcombiningmultiplegenomewidedatasets AT jarvikgailp imputationandqualitycontrolstepsforcombiningmultiplegenomewidedatasets AT kottyanleahc imputationandqualitycontrolstepsforcombiningmultiplegenomewidedatasets AT burtamber imputationandqualitycontrolstepsforcombiningmultiplegenomewidedatasets AT bradfordyuki imputationandqualitycontrolstepsforcombiningmultiplegenomewidedatasets AT armstronggrettad imputationandqualitycontrolstepsforcombiningmultiplegenomewidedatasets AT derrkimberly imputationandqualitycontrolstepsforcombiningmultiplegenomewidedatasets AT crawforddanac imputationandqualitycontrolstepsforcombiningmultiplegenomewidedatasets AT hainesjonathanl imputationandqualitycontrolstepsforcombiningmultiplegenomewidedatasets AT lirongling imputationandqualitycontrolstepsforcombiningmultiplegenomewidedatasets AT crosslindavid imputationandqualitycontrolstepsforcombiningmultiplegenomewidedatasets AT ritchiemarylynd imputationandqualitycontrolstepsforcombiningmultiplegenomewidedatasets |