Cargando…

PIP-SNP: a pipeline for processing SNP data featured as linkage disequilibrium bin mapping, genotype imputing and marker synthesizing

Genome-wide association study data analyses often face two significant challenges: (i) high dimensionality of single-nucleotide polymorphism (SNP) genotypes and (ii) imputation of missing values. SNPs are not independent due to physical linkage and natural selection. The correlation of nearby SNPs i...

Descripción completa

Detalles Bibliográficos
Autores principales: Zhang, Wenchao, Kang, Yun, Dai, Xinbin, Xu, Shizhong, Zhao, Patrick X
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8256826/
https://www.ncbi.nlm.nih.gov/pubmed/34235432
http://dx.doi.org/10.1093/nargab/lqab060
_version_ 1783718175405244416
author Zhang, Wenchao
Kang, Yun
Dai, Xinbin
Xu, Shizhong
Zhao, Patrick X
author_facet Zhang, Wenchao
Kang, Yun
Dai, Xinbin
Xu, Shizhong
Zhao, Patrick X
author_sort Zhang, Wenchao
collection PubMed
description Genome-wide association study data analyses often face two significant challenges: (i) high dimensionality of single-nucleotide polymorphism (SNP) genotypes and (ii) imputation of missing values. SNPs are not independent due to physical linkage and natural selection. The correlation of nearby SNPs is known as linkage disequilibrium (LD), which can be used for LD conceptual SNP bin mapping, missing genotype inferencing and SNP dimension reduction. We used a stochastic process to describe the SNP signals and proposed two types of autocorrelations to measure nearby SNPs’ information redundancy. Based on the calculated autocorrelation coefficients, we constructed LD bins. We adopted a k-nearest neighbors algorithm (kNN) to impute the missing genotypes. We proposed several novel methods to find the optimal synthetic marker to represent the SNP bin. We also proposed methods to evaluate the information loss or information conservation between using the original genome-wide markers and using dimension-reduced synthetic markers. Our performance assessments on the real-life SNP data from a rice recombinant inbred line (RIL) population and a rice HapMap project show that the new methods produce satisfactory results. We implemented these functional modules in C/C++ and streamlined them into a web-based pipeline named PIP-SNP (https://bioinfo.noble.org/PIP_SNP/) for processing SNP data.
format Online
Article
Text
id pubmed-8256826
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-82568262021-07-06 PIP-SNP: a pipeline for processing SNP data featured as linkage disequilibrium bin mapping, genotype imputing and marker synthesizing Zhang, Wenchao Kang, Yun Dai, Xinbin Xu, Shizhong Zhao, Patrick X NAR Genom Bioinform Standard Article Genome-wide association study data analyses often face two significant challenges: (i) high dimensionality of single-nucleotide polymorphism (SNP) genotypes and (ii) imputation of missing values. SNPs are not independent due to physical linkage and natural selection. The correlation of nearby SNPs is known as linkage disequilibrium (LD), which can be used for LD conceptual SNP bin mapping, missing genotype inferencing and SNP dimension reduction. We used a stochastic process to describe the SNP signals and proposed two types of autocorrelations to measure nearby SNPs’ information redundancy. Based on the calculated autocorrelation coefficients, we constructed LD bins. We adopted a k-nearest neighbors algorithm (kNN) to impute the missing genotypes. We proposed several novel methods to find the optimal synthetic marker to represent the SNP bin. We also proposed methods to evaluate the information loss or information conservation between using the original genome-wide markers and using dimension-reduced synthetic markers. Our performance assessments on the real-life SNP data from a rice recombinant inbred line (RIL) population and a rice HapMap project show that the new methods produce satisfactory results. We implemented these functional modules in C/C++ and streamlined them into a web-based pipeline named PIP-SNP (https://bioinfo.noble.org/PIP_SNP/) for processing SNP data. Oxford University Press 2021-07-05 /pmc/articles/PMC8256826/ /pubmed/34235432 http://dx.doi.org/10.1093/nargab/lqab060 Text en © The Author(s) 2021. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) ), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Standard Article
Zhang, Wenchao
Kang, Yun
Dai, Xinbin
Xu, Shizhong
Zhao, Patrick X
PIP-SNP: a pipeline for processing SNP data featured as linkage disequilibrium bin mapping, genotype imputing and marker synthesizing
title PIP-SNP: a pipeline for processing SNP data featured as linkage disequilibrium bin mapping, genotype imputing and marker synthesizing
title_full PIP-SNP: a pipeline for processing SNP data featured as linkage disequilibrium bin mapping, genotype imputing and marker synthesizing
title_fullStr PIP-SNP: a pipeline for processing SNP data featured as linkage disequilibrium bin mapping, genotype imputing and marker synthesizing
title_full_unstemmed PIP-SNP: a pipeline for processing SNP data featured as linkage disequilibrium bin mapping, genotype imputing and marker synthesizing
title_short PIP-SNP: a pipeline for processing SNP data featured as linkage disequilibrium bin mapping, genotype imputing and marker synthesizing
title_sort pip-snp: a pipeline for processing snp data featured as linkage disequilibrium bin mapping, genotype imputing and marker synthesizing
topic Standard Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8256826/
https://www.ncbi.nlm.nih.gov/pubmed/34235432
http://dx.doi.org/10.1093/nargab/lqab060
work_keys_str_mv AT zhangwenchao pipsnpapipelineforprocessingsnpdatafeaturedaslinkagedisequilibriumbinmappinggenotypeimputingandmarkersynthesizing
AT kangyun pipsnpapipelineforprocessingsnpdatafeaturedaslinkagedisequilibriumbinmappinggenotypeimputingandmarkersynthesizing
AT daixinbin pipsnpapipelineforprocessingsnpdatafeaturedaslinkagedisequilibriumbinmappinggenotypeimputingandmarkersynthesizing
AT xushizhong pipsnpapipelineforprocessingsnpdatafeaturedaslinkagedisequilibriumbinmappinggenotypeimputingandmarkersynthesizing
AT zhaopatrickx pipsnpapipelineforprocessingsnpdatafeaturedaslinkagedisequilibriumbinmappinggenotypeimputingandmarkersynthesizing