Cargando…
PIP-SNP: a pipeline for processing SNP data featured as linkage disequilibrium bin mapping, genotype imputing and marker synthesizing
Genome-wide association study data analyses often face two significant challenges: (i) high dimensionality of single-nucleotide polymorphism (SNP) genotypes and (ii) imputation of missing values. SNPs are not independent due to physical linkage and natural selection. The correlation of nearby SNPs i...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8256826/ https://www.ncbi.nlm.nih.gov/pubmed/34235432 http://dx.doi.org/10.1093/nargab/lqab060 |
_version_ | 1783718175405244416 |
---|---|
author | Zhang, Wenchao Kang, Yun Dai, Xinbin Xu, Shizhong Zhao, Patrick X |
author_facet | Zhang, Wenchao Kang, Yun Dai, Xinbin Xu, Shizhong Zhao, Patrick X |
author_sort | Zhang, Wenchao |
collection | PubMed |
description | Genome-wide association study data analyses often face two significant challenges: (i) high dimensionality of single-nucleotide polymorphism (SNP) genotypes and (ii) imputation of missing values. SNPs are not independent due to physical linkage and natural selection. The correlation of nearby SNPs is known as linkage disequilibrium (LD), which can be used for LD conceptual SNP bin mapping, missing genotype inferencing and SNP dimension reduction. We used a stochastic process to describe the SNP signals and proposed two types of autocorrelations to measure nearby SNPs’ information redundancy. Based on the calculated autocorrelation coefficients, we constructed LD bins. We adopted a k-nearest neighbors algorithm (kNN) to impute the missing genotypes. We proposed several novel methods to find the optimal synthetic marker to represent the SNP bin. We also proposed methods to evaluate the information loss or information conservation between using the original genome-wide markers and using dimension-reduced synthetic markers. Our performance assessments on the real-life SNP data from a rice recombinant inbred line (RIL) population and a rice HapMap project show that the new methods produce satisfactory results. We implemented these functional modules in C/C++ and streamlined them into a web-based pipeline named PIP-SNP (https://bioinfo.noble.org/PIP_SNP/) for processing SNP data. |
format | Online Article Text |
id | pubmed-8256826 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-82568262021-07-06 PIP-SNP: a pipeline for processing SNP data featured as linkage disequilibrium bin mapping, genotype imputing and marker synthesizing Zhang, Wenchao Kang, Yun Dai, Xinbin Xu, Shizhong Zhao, Patrick X NAR Genom Bioinform Standard Article Genome-wide association study data analyses often face two significant challenges: (i) high dimensionality of single-nucleotide polymorphism (SNP) genotypes and (ii) imputation of missing values. SNPs are not independent due to physical linkage and natural selection. The correlation of nearby SNPs is known as linkage disequilibrium (LD), which can be used for LD conceptual SNP bin mapping, missing genotype inferencing and SNP dimension reduction. We used a stochastic process to describe the SNP signals and proposed two types of autocorrelations to measure nearby SNPs’ information redundancy. Based on the calculated autocorrelation coefficients, we constructed LD bins. We adopted a k-nearest neighbors algorithm (kNN) to impute the missing genotypes. We proposed several novel methods to find the optimal synthetic marker to represent the SNP bin. We also proposed methods to evaluate the information loss or information conservation between using the original genome-wide markers and using dimension-reduced synthetic markers. Our performance assessments on the real-life SNP data from a rice recombinant inbred line (RIL) population and a rice HapMap project show that the new methods produce satisfactory results. We implemented these functional modules in C/C++ and streamlined them into a web-based pipeline named PIP-SNP (https://bioinfo.noble.org/PIP_SNP/) for processing SNP data. Oxford University Press 2021-07-05 /pmc/articles/PMC8256826/ /pubmed/34235432 http://dx.doi.org/10.1093/nargab/lqab060 Text en © The Author(s) 2021. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) ), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Standard Article Zhang, Wenchao Kang, Yun Dai, Xinbin Xu, Shizhong Zhao, Patrick X PIP-SNP: a pipeline for processing SNP data featured as linkage disequilibrium bin mapping, genotype imputing and marker synthesizing |
title | PIP-SNP: a pipeline for processing SNP data featured as linkage disequilibrium bin mapping, genotype imputing and marker synthesizing |
title_full | PIP-SNP: a pipeline for processing SNP data featured as linkage disequilibrium bin mapping, genotype imputing and marker synthesizing |
title_fullStr | PIP-SNP: a pipeline for processing SNP data featured as linkage disequilibrium bin mapping, genotype imputing and marker synthesizing |
title_full_unstemmed | PIP-SNP: a pipeline for processing SNP data featured as linkage disequilibrium bin mapping, genotype imputing and marker synthesizing |
title_short | PIP-SNP: a pipeline for processing SNP data featured as linkage disequilibrium bin mapping, genotype imputing and marker synthesizing |
title_sort | pip-snp: a pipeline for processing snp data featured as linkage disequilibrium bin mapping, genotype imputing and marker synthesizing |
topic | Standard Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8256826/ https://www.ncbi.nlm.nih.gov/pubmed/34235432 http://dx.doi.org/10.1093/nargab/lqab060 |
work_keys_str_mv | AT zhangwenchao pipsnpapipelineforprocessingsnpdatafeaturedaslinkagedisequilibriumbinmappinggenotypeimputingandmarkersynthesizing AT kangyun pipsnpapipelineforprocessingsnpdatafeaturedaslinkagedisequilibriumbinmappinggenotypeimputingandmarkersynthesizing AT daixinbin pipsnpapipelineforprocessingsnpdatafeaturedaslinkagedisequilibriumbinmappinggenotypeimputingandmarkersynthesizing AT xushizhong pipsnpapipelineforprocessingsnpdatafeaturedaslinkagedisequilibriumbinmappinggenotypeimputingandmarkersynthesizing AT zhaopatrickx pipsnpapipelineforprocessingsnpdatafeaturedaslinkagedisequilibriumbinmappinggenotypeimputingandmarkersynthesizing |