Cargando…

Benchmarking variant identification tools for plant diversity discovery

BACKGROUND: The ability to accurately and comprehensively identify genomic variations is critical for plant studies utilizing high-throughput sequencing. Most bioinformatics tools for processing next-generation sequencing data were originally developed and tested in human studies, raising questions...

Descripción completa

Detalles Bibliográficos
Autores principales:	Wu, Xing, Heffelfinger, Christopher, Zhao, Hongyu, Dellaporta, Stephen L.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2019
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6734213/ https://www.ncbi.nlm.nih.gov/pubmed/31500583 http://dx.doi.org/10.1186/s12864-019-6057-7

_version_	1783450104717377536
author	Wu, Xing Heffelfinger, Christopher Zhao, Hongyu Dellaporta, Stephen L.
author_facet	Wu, Xing Heffelfinger, Christopher Zhao, Hongyu Dellaporta, Stephen L.
author_sort	Wu, Xing
collection	PubMed
description	BACKGROUND: The ability to accurately and comprehensively identify genomic variations is critical for plant studies utilizing high-throughput sequencing. Most bioinformatics tools for processing next-generation sequencing data were originally developed and tested in human studies, raising questions as to their efficacy for plant research. A detailed evaluation of the entire variant calling pipeline, including alignment, variant calling, variant filtering, and imputation was performed on different programs using both simulated and real plant genomic datasets. RESULTS: A comparison of SOAP2, Bowtie2, and BWA-MEM found that BWA-MEM was consistently able to align the most reads with high accuracy, whereas Bowtie2 had the highest overall accuracy. Comparative results of GATK HaplotypCaller versus SAMtools mpileup indicated that the choice of variant caller affected precision and recall differentially depending on the levels of diversity, sequence coverage and genome complexity. A cross-reference experiment of S. lycopersicum and S. pennellii reference genomes revealed the inadequacy of single reference genome for variant discovery that includes distantly-related plant individuals. Machine-learning-based variant filtering strategy outperformed the traditional hard-cutoff strategy resulting in higher number of true positive variants and fewer false positive variants. A 2-step imputation method, which utilized a set of high-confidence SNPs as the reference panel, showed up to 60% higher accuracy than direct LD-based imputation. CONCLUSIONS: Programs in the variant discovery pipeline have different performance on plant genomic dataset. Choice of the programs is subjected to the goal of the study and available resources. This study serves as an important guiding information for plant biologists utilizing next-generation sequencing data for diversity characterization and crop improvement. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12864-019-6057-7) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-6734213
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-67342132019-09-12 Benchmarking variant identification tools for plant diversity discovery Wu, Xing Heffelfinger, Christopher Zhao, Hongyu Dellaporta, Stephen L. BMC Genomics Research Article BACKGROUND: The ability to accurately and comprehensively identify genomic variations is critical for plant studies utilizing high-throughput sequencing. Most bioinformatics tools for processing next-generation sequencing data were originally developed and tested in human studies, raising questions as to their efficacy for plant research. A detailed evaluation of the entire variant calling pipeline, including alignment, variant calling, variant filtering, and imputation was performed on different programs using both simulated and real plant genomic datasets. RESULTS: A comparison of SOAP2, Bowtie2, and BWA-MEM found that BWA-MEM was consistently able to align the most reads with high accuracy, whereas Bowtie2 had the highest overall accuracy. Comparative results of GATK HaplotypCaller versus SAMtools mpileup indicated that the choice of variant caller affected precision and recall differentially depending on the levels of diversity, sequence coverage and genome complexity. A cross-reference experiment of S. lycopersicum and S. pennellii reference genomes revealed the inadequacy of single reference genome for variant discovery that includes distantly-related plant individuals. Machine-learning-based variant filtering strategy outperformed the traditional hard-cutoff strategy resulting in higher number of true positive variants and fewer false positive variants. A 2-step imputation method, which utilized a set of high-confidence SNPs as the reference panel, showed up to 60% higher accuracy than direct LD-based imputation. CONCLUSIONS: Programs in the variant discovery pipeline have different performance on plant genomic dataset. Choice of the programs is subjected to the goal of the study and available resources. This study serves as an important guiding information for plant biologists utilizing next-generation sequencing data for diversity characterization and crop improvement. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12864-019-6057-7) contains supplementary material, which is available to authorized users. BioMed Central 2019-09-09 /pmc/articles/PMC6734213/ /pubmed/31500583 http://dx.doi.org/10.1186/s12864-019-6057-7 Text en © The Author(s). 2019 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Wu, Xing Heffelfinger, Christopher Zhao, Hongyu Dellaporta, Stephen L. Benchmarking variant identification tools for plant diversity discovery
title	Benchmarking variant identification tools for plant diversity discovery
title_full	Benchmarking variant identification tools for plant diversity discovery
title_fullStr	Benchmarking variant identification tools for plant diversity discovery
title_full_unstemmed	Benchmarking variant identification tools for plant diversity discovery
title_short	Benchmarking variant identification tools for plant diversity discovery
title_sort	benchmarking variant identification tools for plant diversity discovery
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6734213/ https://www.ncbi.nlm.nih.gov/pubmed/31500583 http://dx.doi.org/10.1186/s12864-019-6057-7
work_keys_str_mv	AT wuxing benchmarkingvariantidentificationtoolsforplantdiversitydiscovery AT heffelfingerchristopher benchmarkingvariantidentificationtoolsforplantdiversitydiscovery AT zhaohongyu benchmarkingvariantidentificationtoolsforplantdiversitydiscovery AT dellaportastephenl benchmarkingvariantidentificationtoolsforplantdiversitydiscovery

Benchmarking variant identification tools for plant diversity discovery

Ejemplares similares