Cargando…

PIP-SNP: a pipeline for processing SNP data featured as linkage disequilibrium bin mapping, genotype imputing and marker synthesizing

Genome-wide association study data analyses often face two significant challenges: (i) high dimensionality of single-nucleotide polymorphism (SNP) genotypes and (ii) imputation of missing values. SNPs are not independent due to physical linkage and natural selection. The correlation of nearby SNPs i...

Descripción completa

Detalles Bibliográficos
Autores principales:	Zhang, Wenchao, Kang, Yun, Dai, Xinbin, Xu, Shizhong, Zhao, Patrick X
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2021
Materias:	Standard Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8256826/ https://www.ncbi.nlm.nih.gov/pubmed/34235432 http://dx.doi.org/10.1093/nargab/lqab060

_version_	1783718175405244416
author	Zhang, Wenchao Kang, Yun Dai, Xinbin Xu, Shizhong Zhao, Patrick X
author_facet	Zhang, Wenchao Kang, Yun Dai, Xinbin Xu, Shizhong Zhao, Patrick X
author_sort	Zhang, Wenchao
collection	PubMed
description	Genome-wide association study data analyses often face two significant challenges: (i) high dimensionality of single-nucleotide polymorphism (SNP) genotypes and (ii) imputation of missing values. SNPs are not independent due to physical linkage and natural selection. The correlation of nearby SNPs is known as linkage disequilibrium (LD), which can be used for LD conceptual SNP bin mapping, missing genotype inferencing and SNP dimension reduction. We used a stochastic process to describe the SNP signals and proposed two types of autocorrelations to measure nearby SNPs’ information redundancy. Based on the calculated autocorrelation coefficients, we constructed LD bins. We adopted a k-nearest neighbors algorithm (kNN) to impute the missing genotypes. We proposed several novel methods to find the optimal synthetic marker to represent the SNP bin. We also proposed methods to evaluate the information loss or information conservation between using the original genome-wide markers and using dimension-reduced synthetic markers. Our performance assessments on the real-life SNP data from a rice recombinant inbred line (RIL) population and a rice HapMap project show that the new methods produce satisfactory results. We implemented these functional modules in C/C++ and streamlined them into a web-based pipeline named PIP-SNP (https://bioinfo.noble.org/PIP_SNP/) for processing SNP data.
format	Online Article Text
id	pubmed-8256826
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-82568262021-07-06 PIP-SNP: a pipeline for processing SNP data featured as linkage disequilibrium bin mapping, genotype imputing and marker synthesizing Zhang, Wenchao Kang, Yun Dai, Xinbin Xu, Shizhong Zhao, Patrick X NAR Genom Bioinform Standard Article Genome-wide association study data analyses often face two significant challenges: (i) high dimensionality of single-nucleotide polymorphism (SNP) genotypes and (ii) imputation of missing values. SNPs are not independent due to physical linkage and natural selection. The correlation of nearby SNPs is known as linkage disequilibrium (LD), which can be used for LD conceptual SNP bin mapping, missing genotype inferencing and SNP dimension reduction. We used a stochastic process to describe the SNP signals and proposed two types of autocorrelations to measure nearby SNPs’ information redundancy. Based on the calculated autocorrelation coefficients, we constructed LD bins. We adopted a k-nearest neighbors algorithm (kNN) to impute the missing genotypes. We proposed several novel methods to find the optimal synthetic marker to represent the SNP bin. We also proposed methods to evaluate the information loss or information conservation between using the original genome-wide markers and using dimension-reduced synthetic markers. Our performance assessments on the real-life SNP data from a rice recombinant inbred line (RIL) population and a rice HapMap project show that the new methods produce satisfactory results. We implemented these functional modules in C/C++ and streamlined them into a web-based pipeline named PIP-SNP (https://bioinfo.noble.org/PIP_SNP/) for processing SNP data. Oxford University Press 2021-07-05 /pmc/articles/PMC8256826/ /pubmed/34235432 http://dx.doi.org/10.1093/nargab/lqab060 Text en © The Author(s) 2021. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) ), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Standard Article Zhang, Wenchao Kang, Yun Dai, Xinbin Xu, Shizhong Zhao, Patrick X PIP-SNP: a pipeline for processing SNP data featured as linkage disequilibrium bin mapping, genotype imputing and marker synthesizing
title	PIP-SNP: a pipeline for processing SNP data featured as linkage disequilibrium bin mapping, genotype imputing and marker synthesizing
title_full	PIP-SNP: a pipeline for processing SNP data featured as linkage disequilibrium bin mapping, genotype imputing and marker synthesizing
title_fullStr	PIP-SNP: a pipeline for processing SNP data featured as linkage disequilibrium bin mapping, genotype imputing and marker synthesizing
title_full_unstemmed	PIP-SNP: a pipeline for processing SNP data featured as linkage disequilibrium bin mapping, genotype imputing and marker synthesizing
title_short	PIP-SNP: a pipeline for processing SNP data featured as linkage disequilibrium bin mapping, genotype imputing and marker synthesizing
title_sort	pip-snp: a pipeline for processing snp data featured as linkage disequilibrium bin mapping, genotype imputing and marker synthesizing
topic	Standard Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8256826/ https://www.ncbi.nlm.nih.gov/pubmed/34235432 http://dx.doi.org/10.1093/nargab/lqab060
work_keys_str_mv	AT zhangwenchao pipsnpapipelineforprocessingsnpdatafeaturedaslinkagedisequilibriumbinmappinggenotypeimputingandmarkersynthesizing AT kangyun pipsnpapipelineforprocessingsnpdatafeaturedaslinkagedisequilibriumbinmappinggenotypeimputingandmarkersynthesizing AT daixinbin pipsnpapipelineforprocessingsnpdatafeaturedaslinkagedisequilibriumbinmappinggenotypeimputingandmarkersynthesizing AT xushizhong pipsnpapipelineforprocessingsnpdatafeaturedaslinkagedisequilibriumbinmappinggenotypeimputingandmarkersynthesizing AT zhaopatrickx pipsnpapipelineforprocessingsnpdatafeaturedaslinkagedisequilibriumbinmappinggenotypeimputingandmarkersynthesizing

PIP-SNP: a pipeline for processing SNP data featured as linkage disequilibrium bin mapping, genotype imputing and marker synthesizing

Ejemplares similares