Cargando…

Validation and assessment of variant calling pipelines for next-generation sequencing

BACKGROUND: The processing and analysis of the large scale data generated by next-generation sequencing (NGS) experiments is challenging and is a burgeoning area of new methods development. Several new bioinformatics tools have been developed for calling sequence variants from NGS data. Here, we val...

Descripción completa

Detalles Bibliográficos
Autores principales:	Pirooznia, Mehdi, Kramer, Melissa, Parla, Jennifer, Goes, Fernando S, Potash, James B, McCombie, W Richard, Zandi, Peter P
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2014
Materias:	Primary Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4129436/ https://www.ncbi.nlm.nih.gov/pubmed/25078893 http://dx.doi.org/10.1186/1479-7364-8-14

_version_	1782330236392701952
author	Pirooznia, Mehdi Kramer, Melissa Parla, Jennifer Goes, Fernando S Potash, James B McCombie, W Richard Zandi, Peter P
author_facet	Pirooznia, Mehdi Kramer, Melissa Parla, Jennifer Goes, Fernando S Potash, James B McCombie, W Richard Zandi, Peter P
author_sort	Pirooznia, Mehdi
collection	PubMed
description	BACKGROUND: The processing and analysis of the large scale data generated by next-generation sequencing (NGS) experiments is challenging and is a burgeoning area of new methods development. Several new bioinformatics tools have been developed for calling sequence variants from NGS data. Here, we validate the variant calling of these tools and compare their relative accuracy to determine which data processing pipeline is optimal. RESULTS: We developed a unified pipeline for processing NGS data that encompasses four modules: mapping, filtering, realignment and recalibration, and variant calling. We processed 130 subjects from an ongoing whole exome sequencing study through this pipeline. To evaluate the accuracy of each module, we conducted a series of comparisons between the single nucleotide variant (SNV) calls from the NGS data and either gold-standard Sanger sequencing on a total of 700 variants or array genotyping data on a total of 9,935 single-nucleotide polymorphisms. A head to head comparison showed that Genome Analysis Toolkit (GATK) provided more accurate calls than SAMtools (positive predictive value of 92.55% vs. 80.35%, respectively). Realignment of mapped reads and recalibration of base quality scores before SNV calling proved to be crucial to accurate variant calling. GATK HaplotypeCaller algorithm for variant calling outperformed the UnifiedGenotype algorithm. We also showed a relationship between mapping quality, read depth and allele balance, and SNV call accuracy. However, if best practices are used in data processing, then additional filtering based on these metrics provides little gains and accuracies of >99% are achievable. CONCLUSIONS: Our findings will help to determine the best approach for processing NGS data to confidently call variants for downstream analyses. To enable others to implement and replicate our results, all of our codes are freely available at http://metamoodics.org/wes.
format	Online Article Text
id	pubmed-4129436
institution	National Center for Biotechnology Information
language	English
publishDate	2014
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-41294362014-08-13 Validation and assessment of variant calling pipelines for next-generation sequencing Pirooznia, Mehdi Kramer, Melissa Parla, Jennifer Goes, Fernando S Potash, James B McCombie, W Richard Zandi, Peter P Hum Genomics Primary Research BACKGROUND: The processing and analysis of the large scale data generated by next-generation sequencing (NGS) experiments is challenging and is a burgeoning area of new methods development. Several new bioinformatics tools have been developed for calling sequence variants from NGS data. Here, we validate the variant calling of these tools and compare their relative accuracy to determine which data processing pipeline is optimal. RESULTS: We developed a unified pipeline for processing NGS data that encompasses four modules: mapping, filtering, realignment and recalibration, and variant calling. We processed 130 subjects from an ongoing whole exome sequencing study through this pipeline. To evaluate the accuracy of each module, we conducted a series of comparisons between the single nucleotide variant (SNV) calls from the NGS data and either gold-standard Sanger sequencing on a total of 700 variants or array genotyping data on a total of 9,935 single-nucleotide polymorphisms. A head to head comparison showed that Genome Analysis Toolkit (GATK) provided more accurate calls than SAMtools (positive predictive value of 92.55% vs. 80.35%, respectively). Realignment of mapped reads and recalibration of base quality scores before SNV calling proved to be crucial to accurate variant calling. GATK HaplotypeCaller algorithm for variant calling outperformed the UnifiedGenotype algorithm. We also showed a relationship between mapping quality, read depth and allele balance, and SNV call accuracy. However, if best practices are used in data processing, then additional filtering based on these metrics provides little gains and accuracies of >99% are achievable. CONCLUSIONS: Our findings will help to determine the best approach for processing NGS data to confidently call variants for downstream analyses. To enable others to implement and replicate our results, all of our codes are freely available at http://metamoodics.org/wes. BioMed Central 2014-07-30 /pmc/articles/PMC4129436/ /pubmed/25078893 http://dx.doi.org/10.1186/1479-7364-8-14 Text en Copyright © 2014 Pirooznia et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Primary Research Pirooznia, Mehdi Kramer, Melissa Parla, Jennifer Goes, Fernando S Potash, James B McCombie, W Richard Zandi, Peter P Validation and assessment of variant calling pipelines for next-generation sequencing
title	Validation and assessment of variant calling pipelines for next-generation sequencing
title_full	Validation and assessment of variant calling pipelines for next-generation sequencing
title_fullStr	Validation and assessment of variant calling pipelines for next-generation sequencing
title_full_unstemmed	Validation and assessment of variant calling pipelines for next-generation sequencing
title_short	Validation and assessment of variant calling pipelines for next-generation sequencing
title_sort	validation and assessment of variant calling pipelines for next-generation sequencing
topic	Primary Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4129436/ https://www.ncbi.nlm.nih.gov/pubmed/25078893 http://dx.doi.org/10.1186/1479-7364-8-14
work_keys_str_mv	AT piroozniamehdi validationandassessmentofvariantcallingpipelinesfornextgenerationsequencing AT kramermelissa validationandassessmentofvariantcallingpipelinesfornextgenerationsequencing AT parlajennifer validationandassessmentofvariantcallingpipelinesfornextgenerationsequencing AT goesfernandos validationandassessmentofvariantcallingpipelinesfornextgenerationsequencing AT potashjamesb validationandassessmentofvariantcallingpipelinesfornextgenerationsequencing AT mccombiewrichard validationandassessmentofvariantcallingpipelinesfornextgenerationsequencing AT zandipeterp validationandassessmentofvariantcallingpipelinesfornextgenerationsequencing

Validation and assessment of variant calling pipelines for next-generation sequencing

Ejemplares similares