Cargando…

Comprehensive and accurate genetic variant identification from contaminated and low-coverage Mycobacterium tuberculosis whole genome sequencing data

Improved understanding of the genomic variants that allow Mycobacterium tuberculosis (Mtb) to acquire drug resistance, or tolerance, and increase its virulence are important factors in controlling the current tuberculosis epidemic. Current approaches to Mtb sequencing, however, cannot reveal Mtb’s f...

Descripción completa

Detalles Bibliográficos
Autores principales: Heupink, Tim H., Verboven, Lennert, Warren, Robin M., Van Rie, Annelies
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Microbiology Society 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8743552/
https://www.ncbi.nlm.nih.gov/pubmed/34793294
http://dx.doi.org/10.1099/mgen.0.000689
_version_ 1784629927699546112
author Heupink, Tim H.
Verboven, Lennert
Warren, Robin M.
Van Rie, Annelies
author_facet Heupink, Tim H.
Verboven, Lennert
Warren, Robin M.
Van Rie, Annelies
author_sort Heupink, Tim H.
collection PubMed
description Improved understanding of the genomic variants that allow Mycobacterium tuberculosis (Mtb) to acquire drug resistance, or tolerance, and increase its virulence are important factors in controlling the current tuberculosis epidemic. Current approaches to Mtb sequencing, however, cannot reveal Mtb’s full genomic diversity due to the strict requirements of low contamination levels, high Mtb sequence coverage and elimination of complex regions. We have developed the XBS (compleX Bacterial Samples) bioinformatics pipeline, which implements joint calling and machine-learning-based variant filtering tools to specifically improve variant detection in the important Mtb samples that do not meet these criteria, such as those from unbiased sputum samples. Using novel simulated datasets, which permit exact accuracy verification, XBS was compared to the UVP and MTBseq pipelines. Accuracy statistics showed that all three pipelines performed equally well for sequence data that resemble those obtained from culture isolates of high depth of coverage and low-level contamination. In the complex genomic regions, however, XBS accurately identified 9.0 % more SNPs and 8.1 % more single nucleotide insertions and deletions than the WHO-endorsed unified analysis variant pipeline. XBS also had superior accuracy for sequence data that resemble those obtained directly from sputum samples, where depth of coverage is typically very low and contamination levels are high. XBS was the only pipeline not affected by low depth of coverage (5–10×), type of contamination and excessive contamination levels (>50 %). Simulation results were confirmed using whole genome sequencing (WGS) data from clinical samples, confirming the superior performance of XBS with a higher sensitivity (98.8%) when analysing culture isolates and identification of 13.9 % more variable sites in WGS data from sputum samples as compared to MTBseq, without evidence for false positive variants when rRNA regions were excluded. The XBS pipeline facilitates sequencing of less-than-perfect Mtb samples. These advances will benefit future clinical applications of Mtb sequencing, especially WGS directly from clinical specimens, thereby avoiding in vitro biases and making many more samples available for drug resistance and other genomic analyses. The additional genetic resolution and increased sample success rate will improve genome-wide association studies and sequence-based transmission studies.
format Online
Article
Text
id pubmed-8743552
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Microbiology Society
record_format MEDLINE/PubMed
spelling pubmed-87435522022-01-10 Comprehensive and accurate genetic variant identification from contaminated and low-coverage Mycobacterium tuberculosis whole genome sequencing data Heupink, Tim H. Verboven, Lennert Warren, Robin M. Van Rie, Annelies Microb Genom Research Articles Improved understanding of the genomic variants that allow Mycobacterium tuberculosis (Mtb) to acquire drug resistance, or tolerance, and increase its virulence are important factors in controlling the current tuberculosis epidemic. Current approaches to Mtb sequencing, however, cannot reveal Mtb’s full genomic diversity due to the strict requirements of low contamination levels, high Mtb sequence coverage and elimination of complex regions. We have developed the XBS (compleX Bacterial Samples) bioinformatics pipeline, which implements joint calling and machine-learning-based variant filtering tools to specifically improve variant detection in the important Mtb samples that do not meet these criteria, such as those from unbiased sputum samples. Using novel simulated datasets, which permit exact accuracy verification, XBS was compared to the UVP and MTBseq pipelines. Accuracy statistics showed that all three pipelines performed equally well for sequence data that resemble those obtained from culture isolates of high depth of coverage and low-level contamination. In the complex genomic regions, however, XBS accurately identified 9.0 % more SNPs and 8.1 % more single nucleotide insertions and deletions than the WHO-endorsed unified analysis variant pipeline. XBS also had superior accuracy for sequence data that resemble those obtained directly from sputum samples, where depth of coverage is typically very low and contamination levels are high. XBS was the only pipeline not affected by low depth of coverage (5–10×), type of contamination and excessive contamination levels (>50 %). Simulation results were confirmed using whole genome sequencing (WGS) data from clinical samples, confirming the superior performance of XBS with a higher sensitivity (98.8%) when analysing culture isolates and identification of 13.9 % more variable sites in WGS data from sputum samples as compared to MTBseq, without evidence for false positive variants when rRNA regions were excluded. The XBS pipeline facilitates sequencing of less-than-perfect Mtb samples. These advances will benefit future clinical applications of Mtb sequencing, especially WGS directly from clinical specimens, thereby avoiding in vitro biases and making many more samples available for drug resistance and other genomic analyses. The additional genetic resolution and increased sample success rate will improve genome-wide association studies and sequence-based transmission studies. Microbiology Society 2021-11-18 /pmc/articles/PMC8743552/ /pubmed/34793294 http://dx.doi.org/10.1099/mgen.0.000689 Text en © 2021 The Authors https://creativecommons.org/licenses/by-nc/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution NonCommercial License.
spellingShingle Research Articles
Heupink, Tim H.
Verboven, Lennert
Warren, Robin M.
Van Rie, Annelies
Comprehensive and accurate genetic variant identification from contaminated and low-coverage Mycobacterium tuberculosis whole genome sequencing data
title Comprehensive and accurate genetic variant identification from contaminated and low-coverage Mycobacterium tuberculosis whole genome sequencing data
title_full Comprehensive and accurate genetic variant identification from contaminated and low-coverage Mycobacterium tuberculosis whole genome sequencing data
title_fullStr Comprehensive and accurate genetic variant identification from contaminated and low-coverage Mycobacterium tuberculosis whole genome sequencing data
title_full_unstemmed Comprehensive and accurate genetic variant identification from contaminated and low-coverage Mycobacterium tuberculosis whole genome sequencing data
title_short Comprehensive and accurate genetic variant identification from contaminated and low-coverage Mycobacterium tuberculosis whole genome sequencing data
title_sort comprehensive and accurate genetic variant identification from contaminated and low-coverage mycobacterium tuberculosis whole genome sequencing data
topic Research Articles
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8743552/
https://www.ncbi.nlm.nih.gov/pubmed/34793294
http://dx.doi.org/10.1099/mgen.0.000689
work_keys_str_mv AT heupinktimh comprehensiveandaccurategeneticvariantidentificationfromcontaminatedandlowcoveragemycobacteriumtuberculosiswholegenomesequencingdata
AT verbovenlennert comprehensiveandaccurategeneticvariantidentificationfromcontaminatedandlowcoveragemycobacteriumtuberculosiswholegenomesequencingdata
AT warrenrobinm comprehensiveandaccurategeneticvariantidentificationfromcontaminatedandlowcoveragemycobacteriumtuberculosiswholegenomesequencingdata
AT vanrieannelies comprehensiveandaccurategeneticvariantidentificationfromcontaminatedandlowcoveragemycobacteriumtuberculosiswholegenomesequencingdata