Cargando…

An integrative variant analysis pipeline for accurate genotype/haplotype inference in population NGS data

Next-generation sequencing is a powerful approach for discovering genetic variation. Sensitive variant calling and haplotype inference from population sequencing data remain challenging. We describe methods for high-quality discovery, genotyping, and phasing of SNPs for low-coverage (approximately 5...

Descripción completa

Detalles Bibliográficos
Autores principales: Wang, Yi, Lu, James, Yu, Jin, Gibbs, Richard A., Yu, Fuli
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cold Spring Harbor Laboratory Press 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3638139/
https://www.ncbi.nlm.nih.gov/pubmed/23296920
http://dx.doi.org/10.1101/gr.146084.112
_version_ 1782475800894767104
author Wang, Yi
Lu, James
Yu, Jin
Gibbs, Richard A.
Yu, Fuli
author_facet Wang, Yi
Lu, James
Yu, Jin
Gibbs, Richard A.
Yu, Fuli
author_sort Wang, Yi
collection PubMed
description Next-generation sequencing is a powerful approach for discovering genetic variation. Sensitive variant calling and haplotype inference from population sequencing data remain challenging. We describe methods for high-quality discovery, genotyping, and phasing of SNPs for low-coverage (approximately 5×) sequencing of populations, implemented in a pipeline called SNPTools. Our pipeline contains several innovations that specifically address challenges caused by low-coverage population sequencing: (1) effective base depth (EBD), a nonparametric statistic that enables more accurate statistical modeling of sequencing data; (2) variance ratio scoring, a variance-based statistic that discovers polymorphic loci with high sensitivity and specificity; and (3) BAM-specific binomial mixture modeling (BBMM), a clustering algorithm that generates robust genotype likelihoods from heterogeneous sequencing data. Last, we develop an imputation engine that refines raw genotype likelihoods to produce high-quality phased genotypes/haplotypes. Designed for large population studies, SNPTools' input/output (I/O) and storage aware design leads to improved computing performance on large sequencing data sets. We apply SNPTools to the International 1000 Genomes Project (1000G) Phase 1 low-coverage data set and obtain genotyping accuracy comparable to that of SNP microarray.
format Online
Article
Text
id pubmed-3638139
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher Cold Spring Harbor Laboratory Press
record_format MEDLINE/PubMed
spelling pubmed-36381392013-11-01 An integrative variant analysis pipeline for accurate genotype/haplotype inference in population NGS data Wang, Yi Lu, James Yu, Jin Gibbs, Richard A. Yu, Fuli Genome Res Method Next-generation sequencing is a powerful approach for discovering genetic variation. Sensitive variant calling and haplotype inference from population sequencing data remain challenging. We describe methods for high-quality discovery, genotyping, and phasing of SNPs for low-coverage (approximately 5×) sequencing of populations, implemented in a pipeline called SNPTools. Our pipeline contains several innovations that specifically address challenges caused by low-coverage population sequencing: (1) effective base depth (EBD), a nonparametric statistic that enables more accurate statistical modeling of sequencing data; (2) variance ratio scoring, a variance-based statistic that discovers polymorphic loci with high sensitivity and specificity; and (3) BAM-specific binomial mixture modeling (BBMM), a clustering algorithm that generates robust genotype likelihoods from heterogeneous sequencing data. Last, we develop an imputation engine that refines raw genotype likelihoods to produce high-quality phased genotypes/haplotypes. Designed for large population studies, SNPTools' input/output (I/O) and storage aware design leads to improved computing performance on large sequencing data sets. We apply SNPTools to the International 1000 Genomes Project (1000G) Phase 1 low-coverage data set and obtain genotyping accuracy comparable to that of SNP microarray. Cold Spring Harbor Laboratory Press 2013-05 /pmc/articles/PMC3638139/ /pubmed/23296920 http://dx.doi.org/10.1101/gr.146084.112 Text en © 2013, Published by Cold Spring Harbor Laboratory Press http://creativecommons.org/licenses/by-nc/3.0/ This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see http://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 3.0 Unported License), as described at http://creativecommons.org/licenses/by-nc/3.0/.
spellingShingle Method
Wang, Yi
Lu, James
Yu, Jin
Gibbs, Richard A.
Yu, Fuli
An integrative variant analysis pipeline for accurate genotype/haplotype inference in population NGS data
title An integrative variant analysis pipeline for accurate genotype/haplotype inference in population NGS data
title_full An integrative variant analysis pipeline for accurate genotype/haplotype inference in population NGS data
title_fullStr An integrative variant analysis pipeline for accurate genotype/haplotype inference in population NGS data
title_full_unstemmed An integrative variant analysis pipeline for accurate genotype/haplotype inference in population NGS data
title_short An integrative variant analysis pipeline for accurate genotype/haplotype inference in population NGS data
title_sort integrative variant analysis pipeline for accurate genotype/haplotype inference in population ngs data
topic Method
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3638139/
https://www.ncbi.nlm.nih.gov/pubmed/23296920
http://dx.doi.org/10.1101/gr.146084.112
work_keys_str_mv AT wangyi anintegrativevariantanalysispipelineforaccurategenotypehaplotypeinferenceinpopulationngsdata
AT lujames anintegrativevariantanalysispipelineforaccurategenotypehaplotypeinferenceinpopulationngsdata
AT yujin anintegrativevariantanalysispipelineforaccurategenotypehaplotypeinferenceinpopulationngsdata
AT gibbsricharda anintegrativevariantanalysispipelineforaccurategenotypehaplotypeinferenceinpopulationngsdata
AT yufuli anintegrativevariantanalysispipelineforaccurategenotypehaplotypeinferenceinpopulationngsdata
AT wangyi integrativevariantanalysispipelineforaccurategenotypehaplotypeinferenceinpopulationngsdata
AT lujames integrativevariantanalysispipelineforaccurategenotypehaplotypeinferenceinpopulationngsdata
AT yujin integrativevariantanalysispipelineforaccurategenotypehaplotypeinferenceinpopulationngsdata
AT gibbsricharda integrativevariantanalysispipelineforaccurategenotypehaplotypeinferenceinpopulationngsdata
AT yufuli integrativevariantanalysispipelineforaccurategenotypehaplotypeinferenceinpopulationngsdata