Cargando…

Evaluating assembly and variant calling software for strain-resolved analysis of large DNA viruses

Infection with human cytomegalovirus (HCMV) can cause severe complications in immunocompromised individuals and congenitally infected children. Characterizing heterogeneous viral populations and their evolution by high-throughput sequencing of clinical specimens requires the accurate assembly of ind...

Descripción completa

Detalles Bibliográficos
Autores principales: Deng, Zhi-Luo, Dhingra, Akshay, Fritz, Adrian, Götting, Jasper, Münch, Philipp C, Steinbrück, Lars, Schulz, Thomas F, Ganzenmüller, Tina, McHardy, Alice C
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8138829/
https://www.ncbi.nlm.nih.gov/pubmed/34020538
http://dx.doi.org/10.1093/bib/bbaa123
_version_ 1783695885773832192
author Deng, Zhi-Luo
Dhingra, Akshay
Fritz, Adrian
Götting, Jasper
Münch, Philipp C
Steinbrück, Lars
Schulz, Thomas F
Ganzenmüller, Tina
McHardy, Alice C
author_facet Deng, Zhi-Luo
Dhingra, Akshay
Fritz, Adrian
Götting, Jasper
Münch, Philipp C
Steinbrück, Lars
Schulz, Thomas F
Ganzenmüller, Tina
McHardy, Alice C
author_sort Deng, Zhi-Luo
collection PubMed
description Infection with human cytomegalovirus (HCMV) can cause severe complications in immunocompromised individuals and congenitally infected children. Characterizing heterogeneous viral populations and their evolution by high-throughput sequencing of clinical specimens requires the accurate assembly of individual strains or sequence variants and suitable variant calling methods. However, the performance of most methods has not been assessed for populations composed of low divergent viral strains with large genomes, such as HCMV. In an extensive benchmarking study, we evaluated 15 assemblers and 6 variant callers on 10 lab-generated benchmark data sets created with two different library preparation protocols, to identify best practices and challenges for analyzing such data. Most assemblers, especially metaSPAdes and IVA, performed well across a range of metrics in recovering abundant strains. However, only one, Savage, recovered low abundant strains and in a highly fragmented manner. Two variant callers, LoFreq and VarScan2, excelled across all strain abundances. Both shared a large fraction of false positive variant calls, which were strongly enriched in T to G changes in a ‘G.G’ context. The magnitude of this context-dependent systematic error is linked to the experimental protocol. We provide all benchmarking data, results and the entire benchmarking workflow named QuasiModo, Quasispecies Metric determination on omics, under the GNU General Public License v3.0 (https://github.com/hzi-bifo/Quasimodo), to enable full reproducibility and further benchmarking on these and other data.
format Online
Article
Text
id pubmed-8138829
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-81388292021-05-25 Evaluating assembly and variant calling software for strain-resolved analysis of large DNA viruses Deng, Zhi-Luo Dhingra, Akshay Fritz, Adrian Götting, Jasper Münch, Philipp C Steinbrück, Lars Schulz, Thomas F Ganzenmüller, Tina McHardy, Alice C Brief Bioinform Problem Solving Protocol Infection with human cytomegalovirus (HCMV) can cause severe complications in immunocompromised individuals and congenitally infected children. Characterizing heterogeneous viral populations and their evolution by high-throughput sequencing of clinical specimens requires the accurate assembly of individual strains or sequence variants and suitable variant calling methods. However, the performance of most methods has not been assessed for populations composed of low divergent viral strains with large genomes, such as HCMV. In an extensive benchmarking study, we evaluated 15 assemblers and 6 variant callers on 10 lab-generated benchmark data sets created with two different library preparation protocols, to identify best practices and challenges for analyzing such data. Most assemblers, especially metaSPAdes and IVA, performed well across a range of metrics in recovering abundant strains. However, only one, Savage, recovered low abundant strains and in a highly fragmented manner. Two variant callers, LoFreq and VarScan2, excelled across all strain abundances. Both shared a large fraction of false positive variant calls, which were strongly enriched in T to G changes in a ‘G.G’ context. The magnitude of this context-dependent systematic error is linked to the experimental protocol. We provide all benchmarking data, results and the entire benchmarking workflow named QuasiModo, Quasispecies Metric determination on omics, under the GNU General Public License v3.0 (https://github.com/hzi-bifo/Quasimodo), to enable full reproducibility and further benchmarking on these and other data. Oxford University Press 2020-07-07 /pmc/articles/PMC8138829/ /pubmed/34020538 http://dx.doi.org/10.1093/bib/bbaa123 Text en © The Author(s) 2020. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/ (https://creativecommons.org/licenses/by-nc/4.0/) ), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Problem Solving Protocol
Deng, Zhi-Luo
Dhingra, Akshay
Fritz, Adrian
Götting, Jasper
Münch, Philipp C
Steinbrück, Lars
Schulz, Thomas F
Ganzenmüller, Tina
McHardy, Alice C
Evaluating assembly and variant calling software for strain-resolved analysis of large DNA viruses
title Evaluating assembly and variant calling software for strain-resolved analysis of large DNA viruses
title_full Evaluating assembly and variant calling software for strain-resolved analysis of large DNA viruses
title_fullStr Evaluating assembly and variant calling software for strain-resolved analysis of large DNA viruses
title_full_unstemmed Evaluating assembly and variant calling software for strain-resolved analysis of large DNA viruses
title_short Evaluating assembly and variant calling software for strain-resolved analysis of large DNA viruses
title_sort evaluating assembly and variant calling software for strain-resolved analysis of large dna viruses
topic Problem Solving Protocol
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8138829/
https://www.ncbi.nlm.nih.gov/pubmed/34020538
http://dx.doi.org/10.1093/bib/bbaa123
work_keys_str_mv AT dengzhiluo evaluatingassemblyandvariantcallingsoftwareforstrainresolvedanalysisoflargednaviruses
AT dhingraakshay evaluatingassemblyandvariantcallingsoftwareforstrainresolvedanalysisoflargednaviruses
AT fritzadrian evaluatingassemblyandvariantcallingsoftwareforstrainresolvedanalysisoflargednaviruses
AT gottingjasper evaluatingassemblyandvariantcallingsoftwareforstrainresolvedanalysisoflargednaviruses
AT munchphilippc evaluatingassemblyandvariantcallingsoftwareforstrainresolvedanalysisoflargednaviruses
AT steinbrucklars evaluatingassemblyandvariantcallingsoftwareforstrainresolvedanalysisoflargednaviruses
AT schulzthomasf evaluatingassemblyandvariantcallingsoftwareforstrainresolvedanalysisoflargednaviruses
AT ganzenmullertina evaluatingassemblyandvariantcallingsoftwareforstrainresolvedanalysisoflargednaviruses
AT mchardyalicec evaluatingassemblyandvariantcallingsoftwareforstrainresolvedanalysisoflargednaviruses