Cargando…

A ν-support vector regression based approach for predicting imputation quality

BACKGROUND: Decades of genome-wide association studies (GWAS) have accumulated large volumes of genomic data that can potentially be reused to increase statistical power of new studies, but different genotyping platforms with different marker sets have been used as biotechnology has evolved, prevent...

Descripción completa

Detalles Bibliográficos
Autores principales:	Huang, Yi-Hung, Rice, John P, Saccone, Scott F, Ambite, José Luis, Arens, Yigal, Tischfield, Jay A, Hsu, Chun-Nan
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2012
Materias:	Proceedings
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3504919/ https://www.ncbi.nlm.nih.gov/pubmed/23173775 http://dx.doi.org/10.1186/1753-6561-6-S7-S3

_version_	1782250702396981248
author	Huang, Yi-Hung Rice, John P Saccone, Scott F Ambite, José Luis Arens, Yigal Tischfield, Jay A Hsu, Chun-Nan
author_facet	Huang, Yi-Hung Rice, John P Saccone, Scott F Ambite, José Luis Arens, Yigal Tischfield, Jay A Hsu, Chun-Nan
author_sort	Huang, Yi-Hung
collection	PubMed
description	BACKGROUND: Decades of genome-wide association studies (GWAS) have accumulated large volumes of genomic data that can potentially be reused to increase statistical power of new studies, but different genotyping platforms with different marker sets have been used as biotechnology has evolved, preventing pooling and comparability of old and new data. For example, to pool together data collected by 550K chips with newer data collected by 900K chips, we will need to impute missing loci. Many imputation algorithms have been developed, but the posteriori probabilities estimated by those algorithms are not a reliable measure the quality of the imputation. Recently, many studies have used an imputation quality score (IQS) to measure the quality of imputation. The IQS requires to know true alleles to estimate. Only when the population and the imputation loci are identical can we reuse the estimated IQS when the true alleles are unknown. METHODS: Here, we present a regression model to estimate IQS that learns from imputation of loci with known alleles. We designed a small set of features, such as minor allele frequencies, distance to the nearest known cross-over hotspot, etc., for the prediction of IQS. We evaluated our regression models by estimating IQS of imputations by BEAGLE for a set of GWAS data from the NCBI GEO database collected from samples from different ethnic populations. RESULTS: We construct a ν-SVR based approach as our regression model. Our evaluation shows that this regression model can accomplish mean square errors of less than 0.02 and a correlation coefficient close to 0.75 in different imputation scenarios. We also show how the regression results can help remove false positives in association studies. CONCLUSION: Reliable estimation of IQS will facilitate integration and reuse of existing genomic data for meta-analysis and secondary analysis. Experiments show that it is possible to use a small number of features to regress the IQS by learning from different training examples of imputation and IQS pairs.
format	Online Article Text
id	pubmed-3504919
institution	National Center for Biotechnology Information
language	English
publishDate	2012
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-35049192012-11-29 A ν-support vector regression based approach for predicting imputation quality Huang, Yi-Hung Rice, John P Saccone, Scott F Ambite, José Luis Arens, Yigal Tischfield, Jay A Hsu, Chun-Nan BMC Proc Proceedings BACKGROUND: Decades of genome-wide association studies (GWAS) have accumulated large volumes of genomic data that can potentially be reused to increase statistical power of new studies, but different genotyping platforms with different marker sets have been used as biotechnology has evolved, preventing pooling and comparability of old and new data. For example, to pool together data collected by 550K chips with newer data collected by 900K chips, we will need to impute missing loci. Many imputation algorithms have been developed, but the posteriori probabilities estimated by those algorithms are not a reliable measure the quality of the imputation. Recently, many studies have used an imputation quality score (IQS) to measure the quality of imputation. The IQS requires to know true alleles to estimate. Only when the population and the imputation loci are identical can we reuse the estimated IQS when the true alleles are unknown. METHODS: Here, we present a regression model to estimate IQS that learns from imputation of loci with known alleles. We designed a small set of features, such as minor allele frequencies, distance to the nearest known cross-over hotspot, etc., for the prediction of IQS. We evaluated our regression models by estimating IQS of imputations by BEAGLE for a set of GWAS data from the NCBI GEO database collected from samples from different ethnic populations. RESULTS: We construct a ν-SVR based approach as our regression model. Our evaluation shows that this regression model can accomplish mean square errors of less than 0.02 and a correlation coefficient close to 0.75 in different imputation scenarios. We also show how the regression results can help remove false positives in association studies. CONCLUSION: Reliable estimation of IQS will facilitate integration and reuse of existing genomic data for meta-analysis and secondary analysis. Experiments show that it is possible to use a small number of features to regress the IQS by learning from different training examples of imputation and IQS pairs. BioMed Central 2012-11-13 /pmc/articles/PMC3504919/ /pubmed/23173775 http://dx.doi.org/10.1186/1753-6561-6-S7-S3 Text en Copyright ©2012 Huang et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Proceedings Huang, Yi-Hung Rice, John P Saccone, Scott F Ambite, José Luis Arens, Yigal Tischfield, Jay A Hsu, Chun-Nan A ν-support vector regression based approach for predicting imputation quality
title	A ν-support vector regression based approach for predicting imputation quality
title_full	A ν-support vector regression based approach for predicting imputation quality
title_fullStr	A ν-support vector regression based approach for predicting imputation quality
title_full_unstemmed	A ν-support vector regression based approach for predicting imputation quality
title_short	A ν-support vector regression based approach for predicting imputation quality
title_sort	ν-support vector regression based approach for predicting imputation quality
topic	Proceedings
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3504919/ https://www.ncbi.nlm.nih.gov/pubmed/23173775 http://dx.doi.org/10.1186/1753-6561-6-S7-S3
work_keys_str_mv	AT huangyihung ansupportvectorregressionbasedapproachforpredictingimputationquality AT ricejohnp ansupportvectorregressionbasedapproachforpredictingimputationquality AT sacconescottf ansupportvectorregressionbasedapproachforpredictingimputationquality AT ambitejoseluis ansupportvectorregressionbasedapproachforpredictingimputationquality AT arensyigal ansupportvectorregressionbasedapproachforpredictingimputationquality AT tischfieldjaya ansupportvectorregressionbasedapproachforpredictingimputationquality AT hsuchunnan ansupportvectorregressionbasedapproachforpredictingimputationquality AT huangyihung nsupportvectorregressionbasedapproachforpredictingimputationquality AT ricejohnp nsupportvectorregressionbasedapproachforpredictingimputationquality AT sacconescottf nsupportvectorregressionbasedapproachforpredictingimputationquality AT ambitejoseluis nsupportvectorregressionbasedapproachforpredictingimputationquality AT arensyigal nsupportvectorregressionbasedapproachforpredictingimputationquality AT tischfieldjaya nsupportvectorregressionbasedapproachforpredictingimputationquality AT hsuchunnan nsupportvectorregressionbasedapproachforpredictingimputationquality

A ν-support vector regression based approach for predicting imputation quality

Ejemplares similares