Cargando…

An investigation of causes of false positive single nucleotide polymorphisms using simulated reads from a small eukaryote genome

BACKGROUND: Single Nucleotide Polymorphisms (SNPs) are widely used molecular markers, and their use has increased massively since the inception of Next Generation Sequencing (NGS) technologies, which allow detection of large numbers of SNPs at low cost. However, both NGS data and their analysis are...

Descripción completa

Detalles Bibliográficos
Autores principales:	Ribeiro, Antonio, Golicz, Agnieszka, Hackett, Christine Anne, Milne, Iain, Stephen, Gordon, Marshall, David, Flavell, Andrew J., Bayer, Micha
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2015
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4642669/ https://www.ncbi.nlm.nih.gov/pubmed/26558718 http://dx.doi.org/10.1186/s12859-015-0801-z

_version_	1782400402771148800
author	Ribeiro, Antonio Golicz, Agnieszka Hackett, Christine Anne Milne, Iain Stephen, Gordon Marshall, David Flavell, Andrew J. Bayer, Micha
author_facet	Ribeiro, Antonio Golicz, Agnieszka Hackett, Christine Anne Milne, Iain Stephen, Gordon Marshall, David Flavell, Andrew J. Bayer, Micha
author_sort	Ribeiro, Antonio
collection	PubMed
description	BACKGROUND: Single Nucleotide Polymorphisms (SNPs) are widely used molecular markers, and their use has increased massively since the inception of Next Generation Sequencing (NGS) technologies, which allow detection of large numbers of SNPs at low cost. However, both NGS data and their analysis are error-prone, which can lead to the generation of false positive (FP) SNPs. We explored the relationship between FP SNPs and seven factors involved in mapping-based variant calling — quality of the reference sequence, read length, choice of mapper and variant caller, mapping stringency and filtering of SNPs by read mapping quality and read depth. This resulted in 576 possible factor level combinations. We used error- and variant-free simulated reads to ensure that every SNP found was indeed a false positive. RESULTS: The variation in the number of FP SNPs generated ranged from 0 to 36,621 for the 120 million base pairs (Mbp) genome. All of the experimental factors tested had statistically significant effects on the number of FP SNPs generated and there was a considerable amount of interaction between the different factors. Using a fragmented reference sequence led to a dramatic increase in the number of FP SNPs generated, as did relaxed read mapping and a lack of SNP filtering. The choice of reference assembler, mapper and variant caller also significantly affected the outcome. The effect of read length was more complex and suggests a possible interaction between mapping specificity and the potential for contributing more false positives as read length increases. CONCLUSIONS: The choice of tools and parameters involved in variant calling can have a dramatic effect on the number of FP SNPs produced, with particularly poor combinations of software and/or parameter settings yielding tens of thousands in this experiment. Between-factor interactions make simple recommendations difficult for a SNP discovery pipeline but the quality of the reference sequence is clearly of paramount importance. Our findings are also a stark reminder that it can be unwise to use the relaxed mismatch settings provided as defaults by some read mappers when reads are being mapped to a relatively unfinished reference sequence from e.g. a non-model organism in its early stages of genomic exploration. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0801-z) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-4642669
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-46426692015-11-13 An investigation of causes of false positive single nucleotide polymorphisms using simulated reads from a small eukaryote genome Ribeiro, Antonio Golicz, Agnieszka Hackett, Christine Anne Milne, Iain Stephen, Gordon Marshall, David Flavell, Andrew J. Bayer, Micha BMC Bioinformatics Research Article BACKGROUND: Single Nucleotide Polymorphisms (SNPs) are widely used molecular markers, and their use has increased massively since the inception of Next Generation Sequencing (NGS) technologies, which allow detection of large numbers of SNPs at low cost. However, both NGS data and their analysis are error-prone, which can lead to the generation of false positive (FP) SNPs. We explored the relationship between FP SNPs and seven factors involved in mapping-based variant calling — quality of the reference sequence, read length, choice of mapper and variant caller, mapping stringency and filtering of SNPs by read mapping quality and read depth. This resulted in 576 possible factor level combinations. We used error- and variant-free simulated reads to ensure that every SNP found was indeed a false positive. RESULTS: The variation in the number of FP SNPs generated ranged from 0 to 36,621 for the 120 million base pairs (Mbp) genome. All of the experimental factors tested had statistically significant effects on the number of FP SNPs generated and there was a considerable amount of interaction between the different factors. Using a fragmented reference sequence led to a dramatic increase in the number of FP SNPs generated, as did relaxed read mapping and a lack of SNP filtering. The choice of reference assembler, mapper and variant caller also significantly affected the outcome. The effect of read length was more complex and suggests a possible interaction between mapping specificity and the potential for contributing more false positives as read length increases. CONCLUSIONS: The choice of tools and parameters involved in variant calling can have a dramatic effect on the number of FP SNPs produced, with particularly poor combinations of software and/or parameter settings yielding tens of thousands in this experiment. Between-factor interactions make simple recommendations difficult for a SNP discovery pipeline but the quality of the reference sequence is clearly of paramount importance. Our findings are also a stark reminder that it can be unwise to use the relaxed mismatch settings provided as defaults by some read mappers when reads are being mapped to a relatively unfinished reference sequence from e.g. a non-model organism in its early stages of genomic exploration. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0801-z) contains supplementary material, which is available to authorized users. BioMed Central 2015-11-11 /pmc/articles/PMC4642669/ /pubmed/26558718 http://dx.doi.org/10.1186/s12859-015-0801-z Text en © Ribeiro et al. 2015 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Ribeiro, Antonio Golicz, Agnieszka Hackett, Christine Anne Milne, Iain Stephen, Gordon Marshall, David Flavell, Andrew J. Bayer, Micha An investigation of causes of false positive single nucleotide polymorphisms using simulated reads from a small eukaryote genome
title	An investigation of causes of false positive single nucleotide polymorphisms using simulated reads from a small eukaryote genome
title_full	An investigation of causes of false positive single nucleotide polymorphisms using simulated reads from a small eukaryote genome
title_fullStr	An investigation of causes of false positive single nucleotide polymorphisms using simulated reads from a small eukaryote genome
title_full_unstemmed	An investigation of causes of false positive single nucleotide polymorphisms using simulated reads from a small eukaryote genome
title_short	An investigation of causes of false positive single nucleotide polymorphisms using simulated reads from a small eukaryote genome
title_sort	investigation of causes of false positive single nucleotide polymorphisms using simulated reads from a small eukaryote genome
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4642669/ https://www.ncbi.nlm.nih.gov/pubmed/26558718 http://dx.doi.org/10.1186/s12859-015-0801-z
work_keys_str_mv	AT ribeiroantonio aninvestigationofcausesoffalsepositivesinglenucleotidepolymorphismsusingsimulatedreadsfromasmalleukaryotegenome AT goliczagnieszka aninvestigationofcausesoffalsepositivesinglenucleotidepolymorphismsusingsimulatedreadsfromasmalleukaryotegenome AT hackettchristineanne aninvestigationofcausesoffalsepositivesinglenucleotidepolymorphismsusingsimulatedreadsfromasmalleukaryotegenome AT milneiain aninvestigationofcausesoffalsepositivesinglenucleotidepolymorphismsusingsimulatedreadsfromasmalleukaryotegenome AT stephengordon aninvestigationofcausesoffalsepositivesinglenucleotidepolymorphismsusingsimulatedreadsfromasmalleukaryotegenome AT marshalldavid aninvestigationofcausesoffalsepositivesinglenucleotidepolymorphismsusingsimulatedreadsfromasmalleukaryotegenome AT flavellandrewj aninvestigationofcausesoffalsepositivesinglenucleotidepolymorphismsusingsimulatedreadsfromasmalleukaryotegenome AT bayermicha aninvestigationofcausesoffalsepositivesinglenucleotidepolymorphismsusingsimulatedreadsfromasmalleukaryotegenome AT ribeiroantonio investigationofcausesoffalsepositivesinglenucleotidepolymorphismsusingsimulatedreadsfromasmalleukaryotegenome AT goliczagnieszka investigationofcausesoffalsepositivesinglenucleotidepolymorphismsusingsimulatedreadsfromasmalleukaryotegenome AT hackettchristineanne investigationofcausesoffalsepositivesinglenucleotidepolymorphismsusingsimulatedreadsfromasmalleukaryotegenome AT milneiain investigationofcausesoffalsepositivesinglenucleotidepolymorphismsusingsimulatedreadsfromasmalleukaryotegenome AT stephengordon investigationofcausesoffalsepositivesinglenucleotidepolymorphismsusingsimulatedreadsfromasmalleukaryotegenome AT marshalldavid investigationofcausesoffalsepositivesinglenucleotidepolymorphismsusingsimulatedreadsfromasmalleukaryotegenome AT flavellandrewj investigationofcausesoffalsepositivesinglenucleotidepolymorphismsusingsimulatedreadsfromasmalleukaryotegenome AT bayermicha investigationofcausesoffalsepositivesinglenucleotidepolymorphismsusingsimulatedreadsfromasmalleukaryotegenome

An investigation of causes of false positive single nucleotide polymorphisms using simulated reads from a small eukaryote genome

Ejemplares similares