Cargando…

An efficient and scalable analysis framework for variant extraction and refinement from population-scale DNA sequence data

The analysis of next-generation sequencing data is computationally and statistically challenging because of the massive volume of data and imperfect data quality. We present GotCloud, a pipeline for efficiently detecting and genotyping high-quality variants from large-scale sequencing data. GotCloud...

Descripción completa

Detalles Bibliográficos
Autores principales:	Jun, Goo, Wing, Mary Kate, Abecasis, Gonçalo R., Kang, Hyun Min
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Cold Spring Harbor Laboratory Press 2015
Materias:	Method
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4448687/ https://www.ncbi.nlm.nih.gov/pubmed/25883319 http://dx.doi.org/10.1101/gr.176552.114

_version_	1782373748486176768
author	Jun, Goo Wing, Mary Kate Abecasis, Gonçalo R. Kang, Hyun Min
author_facet	Jun, Goo Wing, Mary Kate Abecasis, Gonçalo R. Kang, Hyun Min
author_sort	Jun, Goo
collection	PubMed
description	The analysis of next-generation sequencing data is computationally and statistically challenging because of the massive volume of data and imperfect data quality. We present GotCloud, a pipeline for efficiently detecting and genotyping high-quality variants from large-scale sequencing data. GotCloud automates sequence alignment, sample-level quality control, variant calling, filtering of likely artifacts using machine-learning techniques, and genotype refinement using haplotype information. The pipeline can process thousands of samples in parallel and requires less computational resources than current alternatives. Experiments with whole-genome and exome-targeted sequence data generated by the 1000 Genomes Project show that the pipeline provides effective filtering against false positive variants and high power to detect true variants. Our pipeline has already contributed to variant detection and genotyping in several large-scale sequencing projects, including the 1000 Genomes Project and the NHLBI Exome Sequencing Project. We hope it will now prove useful to many medical sequencing studies.
format	Online Article Text
id	pubmed-4448687
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	Cold Spring Harbor Laboratory Press
record_format	MEDLINE/PubMed
spelling	pubmed-44486872015-12-01 An efficient and scalable analysis framework for variant extraction and refinement from population-scale DNA sequence data Jun, Goo Wing, Mary Kate Abecasis, Gonçalo R. Kang, Hyun Min Genome Res Method The analysis of next-generation sequencing data is computationally and statistically challenging because of the massive volume of data and imperfect data quality. We present GotCloud, a pipeline for efficiently detecting and genotyping high-quality variants from large-scale sequencing data. GotCloud automates sequence alignment, sample-level quality control, variant calling, filtering of likely artifacts using machine-learning techniques, and genotype refinement using haplotype information. The pipeline can process thousands of samples in parallel and requires less computational resources than current alternatives. Experiments with whole-genome and exome-targeted sequence data generated by the 1000 Genomes Project show that the pipeline provides effective filtering against false positive variants and high power to detect true variants. Our pipeline has already contributed to variant detection and genotyping in several large-scale sequencing projects, including the 1000 Genomes Project and the NHLBI Exome Sequencing Project. We hope it will now prove useful to many medical sequencing studies. Cold Spring Harbor Laboratory Press 2015-06 /pmc/articles/PMC4448687/ /pubmed/25883319 http://dx.doi.org/10.1101/gr.176552.114 Text en © 2015 Jun et al.; Published by Cold Spring Harbor Laboratory Press http://creativecommons.org/licenses/by-nc/4.0/ This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see http://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.
spellingShingle	Method Jun, Goo Wing, Mary Kate Abecasis, Gonçalo R. Kang, Hyun Min An efficient and scalable analysis framework for variant extraction and refinement from population-scale DNA sequence data
title	An efficient and scalable analysis framework for variant extraction and refinement from population-scale DNA sequence data
title_full	An efficient and scalable analysis framework for variant extraction and refinement from population-scale DNA sequence data
title_fullStr	An efficient and scalable analysis framework for variant extraction and refinement from population-scale DNA sequence data
title_full_unstemmed	An efficient and scalable analysis framework for variant extraction and refinement from population-scale DNA sequence data
title_short	An efficient and scalable analysis framework for variant extraction and refinement from population-scale DNA sequence data
title_sort	efficient and scalable analysis framework for variant extraction and refinement from population-scale dna sequence data
topic	Method
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4448687/ https://www.ncbi.nlm.nih.gov/pubmed/25883319 http://dx.doi.org/10.1101/gr.176552.114
work_keys_str_mv	AT jungoo anefficientandscalableanalysisframeworkforvariantextractionandrefinementfrompopulationscalednasequencedata AT wingmarykate anefficientandscalableanalysisframeworkforvariantextractionandrefinementfrompopulationscalednasequencedata AT abecasisgoncalor anefficientandscalableanalysisframeworkforvariantextractionandrefinementfrompopulationscalednasequencedata AT kanghyunmin anefficientandscalableanalysisframeworkforvariantextractionandrefinementfrompopulationscalednasequencedata AT jungoo efficientandscalableanalysisframeworkforvariantextractionandrefinementfrompopulationscalednasequencedata AT wingmarykate efficientandscalableanalysisframeworkforvariantextractionandrefinementfrompopulationscalednasequencedata AT abecasisgoncalor efficientandscalableanalysisframeworkforvariantextractionandrefinementfrompopulationscalednasequencedata AT kanghyunmin efficientandscalableanalysisframeworkforvariantextractionandrefinementfrompopulationscalednasequencedata

An efficient and scalable analysis framework for variant extraction and refinement from population-scale DNA sequence data

Ejemplares similares