Cargando…

A bioinformatics pipeline for Mycobacterium tuberculosis sequencing that cleans contaminant reads from sputum samples

Next-Generation Sequencing (NGS) is widely used to investigate genomic variation. In several studies, the genetic variation of Mycobacterium tuberculosis has been analyzed in sputum samples without previous culture, using target enrichment methodologies for NGS. Alignments obtained by different prog...

Descripción completa

Detalles Bibliográficos
Autores principales: Cuevas-Córdoba, Betzaida, Fresno, Cristóbal, Haase-Hernández, Joshua I., Barbosa-Amezcua, Martín, Mata-Rocha, Minerva, Muñoz-Torrico, Marcela, Salazar-Lezama, Miguel A., Martínez-Orozco, José A., Narváez-Díaz, Luis A., Salas-Hernández, Jorge, González-Covarrubias, Vanessa, Soberón, Xavier
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8547644/
https://www.ncbi.nlm.nih.gov/pubmed/34699523
http://dx.doi.org/10.1371/journal.pone.0258774
_version_ 1784590417154539520
author Cuevas-Córdoba, Betzaida
Fresno, Cristóbal
Haase-Hernández, Joshua I.
Barbosa-Amezcua, Martín
Mata-Rocha, Minerva
Muñoz-Torrico, Marcela
Salazar-Lezama, Miguel A.
Martínez-Orozco, José A.
Narváez-Díaz, Luis A.
Salas-Hernández, Jorge
González-Covarrubias, Vanessa
Soberón, Xavier
author_facet Cuevas-Córdoba, Betzaida
Fresno, Cristóbal
Haase-Hernández, Joshua I.
Barbosa-Amezcua, Martín
Mata-Rocha, Minerva
Muñoz-Torrico, Marcela
Salazar-Lezama, Miguel A.
Martínez-Orozco, José A.
Narváez-Díaz, Luis A.
Salas-Hernández, Jorge
González-Covarrubias, Vanessa
Soberón, Xavier
author_sort Cuevas-Córdoba, Betzaida
collection PubMed
description Next-Generation Sequencing (NGS) is widely used to investigate genomic variation. In several studies, the genetic variation of Mycobacterium tuberculosis has been analyzed in sputum samples without previous culture, using target enrichment methodologies for NGS. Alignments obtained by different programs generally map the sequences under default parameters, and from these results, it is assumed that only Mycobacterium reads will be obtained. However, variants of interest microorganism in clinical samples can be confused with a vast collection of reads from other bacteria, viruses, and human DNA. Currently, there are no standardized pipelines, and the cleaning success is never verified since there is a lack of rigorous controls to identify and remove reads from other sputum-microorganisms genetically similar to M. tuberculosis. Therefore, we designed a bioinformatic pipeline to process NGS data from sputum samples, including several filters and quality control points to identify and eliminate non-M. tuberculosis reads to obtain a reliable genetic variant report. Our proposal uses the SURPI software as a taxonomic classifier to filter input sequences and perform a mapping that provides the highest percentage of Mycobacterium reads, minimizing the reads from other microorganisms. We then use the filtered sequences to perform variant calling with the GATK software, ensuring the mapping quality, realignment, recalibration, hard-filtering, and post-filter to increase the reliability of the reported variants. Using default mapping parameters, we identified reads of contaminant bacteria, such as Streptococcus, Rhotia, Actinomyces, and Veillonella. Our final mapping strategy allowed a sequence identity of 97.8% between the input reads and the whole M. tuberculosis reference genome H37Rv using a genomic edit distance of three, thus removing 98.8% of the off-target sequences with a Mycobacterium reads loss of 1.7%. Finally, more than 200 unreliable genetic variants were removed during the variant calling, increasing the report’s reliability.
format Online
Article
Text
id pubmed-8547644
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-85476442021-10-27 A bioinformatics pipeline for Mycobacterium tuberculosis sequencing that cleans contaminant reads from sputum samples Cuevas-Córdoba, Betzaida Fresno, Cristóbal Haase-Hernández, Joshua I. Barbosa-Amezcua, Martín Mata-Rocha, Minerva Muñoz-Torrico, Marcela Salazar-Lezama, Miguel A. Martínez-Orozco, José A. Narváez-Díaz, Luis A. Salas-Hernández, Jorge González-Covarrubias, Vanessa Soberón, Xavier PLoS One Research Article Next-Generation Sequencing (NGS) is widely used to investigate genomic variation. In several studies, the genetic variation of Mycobacterium tuberculosis has been analyzed in sputum samples without previous culture, using target enrichment methodologies for NGS. Alignments obtained by different programs generally map the sequences under default parameters, and from these results, it is assumed that only Mycobacterium reads will be obtained. However, variants of interest microorganism in clinical samples can be confused with a vast collection of reads from other bacteria, viruses, and human DNA. Currently, there are no standardized pipelines, and the cleaning success is never verified since there is a lack of rigorous controls to identify and remove reads from other sputum-microorganisms genetically similar to M. tuberculosis. Therefore, we designed a bioinformatic pipeline to process NGS data from sputum samples, including several filters and quality control points to identify and eliminate non-M. tuberculosis reads to obtain a reliable genetic variant report. Our proposal uses the SURPI software as a taxonomic classifier to filter input sequences and perform a mapping that provides the highest percentage of Mycobacterium reads, minimizing the reads from other microorganisms. We then use the filtered sequences to perform variant calling with the GATK software, ensuring the mapping quality, realignment, recalibration, hard-filtering, and post-filter to increase the reliability of the reported variants. Using default mapping parameters, we identified reads of contaminant bacteria, such as Streptococcus, Rhotia, Actinomyces, and Veillonella. Our final mapping strategy allowed a sequence identity of 97.8% between the input reads and the whole M. tuberculosis reference genome H37Rv using a genomic edit distance of three, thus removing 98.8% of the off-target sequences with a Mycobacterium reads loss of 1.7%. Finally, more than 200 unreliable genetic variants were removed during the variant calling, increasing the report’s reliability. Public Library of Science 2021-10-26 /pmc/articles/PMC8547644/ /pubmed/34699523 http://dx.doi.org/10.1371/journal.pone.0258774 Text en © 2021 Cuevas-Córdoba et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Cuevas-Córdoba, Betzaida
Fresno, Cristóbal
Haase-Hernández, Joshua I.
Barbosa-Amezcua, Martín
Mata-Rocha, Minerva
Muñoz-Torrico, Marcela
Salazar-Lezama, Miguel A.
Martínez-Orozco, José A.
Narváez-Díaz, Luis A.
Salas-Hernández, Jorge
González-Covarrubias, Vanessa
Soberón, Xavier
A bioinformatics pipeline for Mycobacterium tuberculosis sequencing that cleans contaminant reads from sputum samples
title A bioinformatics pipeline for Mycobacterium tuberculosis sequencing that cleans contaminant reads from sputum samples
title_full A bioinformatics pipeline for Mycobacterium tuberculosis sequencing that cleans contaminant reads from sputum samples
title_fullStr A bioinformatics pipeline for Mycobacterium tuberculosis sequencing that cleans contaminant reads from sputum samples
title_full_unstemmed A bioinformatics pipeline for Mycobacterium tuberculosis sequencing that cleans contaminant reads from sputum samples
title_short A bioinformatics pipeline for Mycobacterium tuberculosis sequencing that cleans contaminant reads from sputum samples
title_sort bioinformatics pipeline for mycobacterium tuberculosis sequencing that cleans contaminant reads from sputum samples
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8547644/
https://www.ncbi.nlm.nih.gov/pubmed/34699523
http://dx.doi.org/10.1371/journal.pone.0258774
work_keys_str_mv AT cuevascordobabetzaida abioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples
AT fresnocristobal abioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples
AT haasehernandezjoshuai abioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples
AT barbosaamezcuamartin abioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples
AT matarochaminerva abioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples
AT munoztorricomarcela abioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples
AT salazarlezamamiguela abioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples
AT martinezorozcojosea abioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples
AT narvaezdiazluisa abioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples
AT salashernandezjorge abioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples
AT gonzalezcovarrubiasvanessa abioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples
AT soberonxavier abioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples
AT cuevascordobabetzaida bioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples
AT fresnocristobal bioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples
AT haasehernandezjoshuai bioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples
AT barbosaamezcuamartin bioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples
AT matarochaminerva bioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples
AT munoztorricomarcela bioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples
AT salazarlezamamiguela bioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples
AT martinezorozcojosea bioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples
AT narvaezdiazluisa bioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples
AT salashernandezjorge bioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples
AT gonzalezcovarrubiasvanessa bioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples
AT soberonxavier bioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples