Cargando…

Genomic diversity affects the accuracy of bacterial single-nucleotide polymorphism–calling pipelines

BACKGROUND: Accurately identifying single-nucleotide polymorphisms (SNPs) from bacterial sequencing data is an essential requirement for using genomics to track transmission and predict important phenotypes such as antimicrobial resistance. However, most previous performance evaluations of SNP calli...

Descripción completa

Detalles Bibliográficos
Autores principales: Bush, Stephen J, Foster, Dona, Eyre, David W, Clark, Emily L, De Maio, Nicola, Shaw, Liam P, Stoesser, Nicole, Peto, Tim E A, Crook, Derrick W, Walker, A Sarah
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7002876/
https://www.ncbi.nlm.nih.gov/pubmed/32025702
http://dx.doi.org/10.1093/gigascience/giaa007
_version_ 1783494438678429696
author Bush, Stephen J
Foster, Dona
Eyre, David W
Clark, Emily L
De Maio, Nicola
Shaw, Liam P
Stoesser, Nicole
Peto, Tim E A
Crook, Derrick W
Walker, A Sarah
author_facet Bush, Stephen J
Foster, Dona
Eyre, David W
Clark, Emily L
De Maio, Nicola
Shaw, Liam P
Stoesser, Nicole
Peto, Tim E A
Crook, Derrick W
Walker, A Sarah
author_sort Bush, Stephen J
collection PubMed
description BACKGROUND: Accurately identifying single-nucleotide polymorphisms (SNPs) from bacterial sequencing data is an essential requirement for using genomics to track transmission and predict important phenotypes such as antimicrobial resistance. However, most previous performance evaluations of SNP calling have been restricted to eukaryotic (human) data. Additionally, bacterial SNP calling requires choosing an appropriate reference genome to align reads to, which, together with the bioinformatic pipeline, affects the accuracy and completeness of a set of SNP calls obtained. This study evaluates the performance of 209 SNP-calling pipelines using a combination of simulated data from 254 strains of 10 clinically common bacteria and real data from environmentally sourced and genomically diverse isolates within the genera Citrobacter, Enterobacter, Escherichia, and Klebsiella. RESULTS: We evaluated the performance of 209 SNP-calling pipelines, aligning reads to genomes of the same or a divergent strain. Irrespective of pipeline, a principal determinant of reliable SNP calling was reference genome selection. Across multiple taxa, there was a strong inverse relationship between pipeline sensitivity and precision, and the Mash distance (a proxy for average nucleotide divergence) between reads and reference genome. The effect was especially pronounced for diverse, recombinogenic bacteria such as Escherichia coli but less dominant for clonal species such as Mycobacterium tuberculosis. CONCLUSIONS: The accuracy of SNP calling for a given species is compromised by increasing intra-species diversity. When reads were aligned to the same genome from which they were sequenced, among the highest-performing pipelines was Novoalign/GATK. By contrast, when reads were aligned to particularly divergent genomes, the highest-performing pipelines often used the aligners NextGenMap or SMALT, and/or the variant callers LoFreq, mpileup, or Strelka.
format Online
Article
Text
id pubmed-7002876
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-70028762020-02-10 Genomic diversity affects the accuracy of bacterial single-nucleotide polymorphism–calling pipelines Bush, Stephen J Foster, Dona Eyre, David W Clark, Emily L De Maio, Nicola Shaw, Liam P Stoesser, Nicole Peto, Tim E A Crook, Derrick W Walker, A Sarah Gigascience Research BACKGROUND: Accurately identifying single-nucleotide polymorphisms (SNPs) from bacterial sequencing data is an essential requirement for using genomics to track transmission and predict important phenotypes such as antimicrobial resistance. However, most previous performance evaluations of SNP calling have been restricted to eukaryotic (human) data. Additionally, bacterial SNP calling requires choosing an appropriate reference genome to align reads to, which, together with the bioinformatic pipeline, affects the accuracy and completeness of a set of SNP calls obtained. This study evaluates the performance of 209 SNP-calling pipelines using a combination of simulated data from 254 strains of 10 clinically common bacteria and real data from environmentally sourced and genomically diverse isolates within the genera Citrobacter, Enterobacter, Escherichia, and Klebsiella. RESULTS: We evaluated the performance of 209 SNP-calling pipelines, aligning reads to genomes of the same or a divergent strain. Irrespective of pipeline, a principal determinant of reliable SNP calling was reference genome selection. Across multiple taxa, there was a strong inverse relationship between pipeline sensitivity and precision, and the Mash distance (a proxy for average nucleotide divergence) between reads and reference genome. The effect was especially pronounced for diverse, recombinogenic bacteria such as Escherichia coli but less dominant for clonal species such as Mycobacterium tuberculosis. CONCLUSIONS: The accuracy of SNP calling for a given species is compromised by increasing intra-species diversity. When reads were aligned to the same genome from which they were sequenced, among the highest-performing pipelines was Novoalign/GATK. By contrast, when reads were aligned to particularly divergent genomes, the highest-performing pipelines often used the aligners NextGenMap or SMALT, and/or the variant callers LoFreq, mpileup, or Strelka. Oxford University Press 2020-02-06 /pmc/articles/PMC7002876/ /pubmed/32025702 http://dx.doi.org/10.1093/gigascience/giaa007 Text en © The Author(s) 2020. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Bush, Stephen J
Foster, Dona
Eyre, David W
Clark, Emily L
De Maio, Nicola
Shaw, Liam P
Stoesser, Nicole
Peto, Tim E A
Crook, Derrick W
Walker, A Sarah
Genomic diversity affects the accuracy of bacterial single-nucleotide polymorphism–calling pipelines
title Genomic diversity affects the accuracy of bacterial single-nucleotide polymorphism–calling pipelines
title_full Genomic diversity affects the accuracy of bacterial single-nucleotide polymorphism–calling pipelines
title_fullStr Genomic diversity affects the accuracy of bacterial single-nucleotide polymorphism–calling pipelines
title_full_unstemmed Genomic diversity affects the accuracy of bacterial single-nucleotide polymorphism–calling pipelines
title_short Genomic diversity affects the accuracy of bacterial single-nucleotide polymorphism–calling pipelines
title_sort genomic diversity affects the accuracy of bacterial single-nucleotide polymorphism–calling pipelines
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7002876/
https://www.ncbi.nlm.nih.gov/pubmed/32025702
http://dx.doi.org/10.1093/gigascience/giaa007
work_keys_str_mv AT bushstephenj genomicdiversityaffectstheaccuracyofbacterialsinglenucleotidepolymorphismcallingpipelines
AT fosterdona genomicdiversityaffectstheaccuracyofbacterialsinglenucleotidepolymorphismcallingpipelines
AT eyredavidw genomicdiversityaffectstheaccuracyofbacterialsinglenucleotidepolymorphismcallingpipelines
AT clarkemilyl genomicdiversityaffectstheaccuracyofbacterialsinglenucleotidepolymorphismcallingpipelines
AT demaionicola genomicdiversityaffectstheaccuracyofbacterialsinglenucleotidepolymorphismcallingpipelines
AT shawliamp genomicdiversityaffectstheaccuracyofbacterialsinglenucleotidepolymorphismcallingpipelines
AT stoessernicole genomicdiversityaffectstheaccuracyofbacterialsinglenucleotidepolymorphismcallingpipelines
AT petotimea genomicdiversityaffectstheaccuracyofbacterialsinglenucleotidepolymorphismcallingpipelines
AT crookderrickw genomicdiversityaffectstheaccuracyofbacterialsinglenucleotidepolymorphismcallingpipelines
AT walkerasarah genomicdiversityaffectstheaccuracyofbacterialsinglenucleotidepolymorphismcallingpipelines