Cargando…

Zebra: Static and Dynamic Genome Cover Thresholds with Overlapping References

Assigning taxonomy remains a challenging topic in microbiome studies, due largely to ambiguity of reads which overlap multiple reference genomes. With the Web of Life (WoL) reference database hosting 10,575 reference genomes and growing, the percentage of ambiguous reads will only increase. The resu...

Descripción completa

Detalles Bibliográficos
Autores principales: Hakim, Daniel, Wandro, Stephen, Zengler, Karsten, Zaramela, Livia S., Nowinski, Brent, Swafford, Austin, Zhu, Qiyun, Song, Se Jin, Gonzalez, Antonio, McDonald, Daniel, Knight, Rob
Formato: Online Artículo Texto
Lenguaje:English
Publicado: American Society for Microbiology 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9600373/
https://www.ncbi.nlm.nih.gov/pubmed/36073806
http://dx.doi.org/10.1128/msystems.00758-22
_version_ 1784816826022100992
author Hakim, Daniel
Wandro, Stephen
Zengler, Karsten
Zaramela, Livia S.
Nowinski, Brent
Swafford, Austin
Zhu, Qiyun
Song, Se Jin
Gonzalez, Antonio
McDonald, Daniel
Knight, Rob
author_facet Hakim, Daniel
Wandro, Stephen
Zengler, Karsten
Zaramela, Livia S.
Nowinski, Brent
Swafford, Austin
Zhu, Qiyun
Song, Se Jin
Gonzalez, Antonio
McDonald, Daniel
Knight, Rob
author_sort Hakim, Daniel
collection PubMed
description Assigning taxonomy remains a challenging topic in microbiome studies, due largely to ambiguity of reads which overlap multiple reference genomes. With the Web of Life (WoL) reference database hosting 10,575 reference genomes and growing, the percentage of ambiguous reads will only increase. The resulting artifacts create both the illusion of co-occurrence and a long tail end of extraneous reference hits that confound interpretation. We introduce genome cover, the fraction of reference genome overlapped by reads, to distinguish these artifacts. We show how to dynamically predict genome cover by read count and examine our model in Staphylococcus aureus monoculture. Our modeling cleanly separates both S. aureus and true contaminants from the false artifacts of reference overlap. We next introduce saturated genome cover, the true fraction of a reference genome overlapped by sample contents. Genome cover may not saturate for low abundance or low prevalence bacteria. We assuage this worry with examination of a large human fecal data set. By compositing the metric across like samples, genome cover saturates even for rare species. We note that it is a threshold on saturated genome cover, not genome cover itself, which indicates a spurious reference hit or distant relative. We present Zebra, a method to compute and threshold the genome cover metric across like samples, a recurrence to estimate genome cover and confirm saturation, and provide guidance for choosing cover thresholds in real world scenarios. Standalone genome cover and integration into Woltka are available: https://github.com/biocore/zebra_filter, https://github.com/qiyunzhu/woltka. IMPORTANCE Taxonomic assignment, assigning sequences to specific taxonomic units, is a crucial processing step in microbiome analyses. Issues in taxonomic assignment affect interpretation of what microbes are present in each sample and may be associated with specific environmental or clinical conditions. Assigning importance to a particular taxon relies strongly on independence of assigned counts. The false inclusion of thousands of correlated taxa makes interpretation ambiguous, leading to underconstrained results which cannot be reproduced. The importance sometimes attached to implausible artifacts such as anthrax or bubonic plague is especially problematic. We show that the Zebra filter retrieves only the nearest relatives of sample contents enabling more reproducible and biologically plausible interpretation of metagenomic data.
format Online
Article
Text
id pubmed-9600373
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher American Society for Microbiology
record_format MEDLINE/PubMed
spelling pubmed-96003732022-10-27 Zebra: Static and Dynamic Genome Cover Thresholds with Overlapping References Hakim, Daniel Wandro, Stephen Zengler, Karsten Zaramela, Livia S. Nowinski, Brent Swafford, Austin Zhu, Qiyun Song, Se Jin Gonzalez, Antonio McDonald, Daniel Knight, Rob mSystems Observation Assigning taxonomy remains a challenging topic in microbiome studies, due largely to ambiguity of reads which overlap multiple reference genomes. With the Web of Life (WoL) reference database hosting 10,575 reference genomes and growing, the percentage of ambiguous reads will only increase. The resulting artifacts create both the illusion of co-occurrence and a long tail end of extraneous reference hits that confound interpretation. We introduce genome cover, the fraction of reference genome overlapped by reads, to distinguish these artifacts. We show how to dynamically predict genome cover by read count and examine our model in Staphylococcus aureus monoculture. Our modeling cleanly separates both S. aureus and true contaminants from the false artifacts of reference overlap. We next introduce saturated genome cover, the true fraction of a reference genome overlapped by sample contents. Genome cover may not saturate for low abundance or low prevalence bacteria. We assuage this worry with examination of a large human fecal data set. By compositing the metric across like samples, genome cover saturates even for rare species. We note that it is a threshold on saturated genome cover, not genome cover itself, which indicates a spurious reference hit or distant relative. We present Zebra, a method to compute and threshold the genome cover metric across like samples, a recurrence to estimate genome cover and confirm saturation, and provide guidance for choosing cover thresholds in real world scenarios. Standalone genome cover and integration into Woltka are available: https://github.com/biocore/zebra_filter, https://github.com/qiyunzhu/woltka. IMPORTANCE Taxonomic assignment, assigning sequences to specific taxonomic units, is a crucial processing step in microbiome analyses. Issues in taxonomic assignment affect interpretation of what microbes are present in each sample and may be associated with specific environmental or clinical conditions. Assigning importance to a particular taxon relies strongly on independence of assigned counts. The false inclusion of thousands of correlated taxa makes interpretation ambiguous, leading to underconstrained results which cannot be reproduced. The importance sometimes attached to implausible artifacts such as anthrax or bubonic plague is especially problematic. We show that the Zebra filter retrieves only the nearest relatives of sample contents enabling more reproducible and biologically plausible interpretation of metagenomic data. American Society for Microbiology 2022-09-08 /pmc/articles/PMC9600373/ /pubmed/36073806 http://dx.doi.org/10.1128/msystems.00758-22 Text en Copyright © 2022 Hakim et al. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International license (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Observation
Hakim, Daniel
Wandro, Stephen
Zengler, Karsten
Zaramela, Livia S.
Nowinski, Brent
Swafford, Austin
Zhu, Qiyun
Song, Se Jin
Gonzalez, Antonio
McDonald, Daniel
Knight, Rob
Zebra: Static and Dynamic Genome Cover Thresholds with Overlapping References
title Zebra: Static and Dynamic Genome Cover Thresholds with Overlapping References
title_full Zebra: Static and Dynamic Genome Cover Thresholds with Overlapping References
title_fullStr Zebra: Static and Dynamic Genome Cover Thresholds with Overlapping References
title_full_unstemmed Zebra: Static and Dynamic Genome Cover Thresholds with Overlapping References
title_short Zebra: Static and Dynamic Genome Cover Thresholds with Overlapping References
title_sort zebra: static and dynamic genome cover thresholds with overlapping references
topic Observation
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9600373/
https://www.ncbi.nlm.nih.gov/pubmed/36073806
http://dx.doi.org/10.1128/msystems.00758-22
work_keys_str_mv AT hakimdaniel zebrastaticanddynamicgenomecoverthresholdswithoverlappingreferences
AT wandrostephen zebrastaticanddynamicgenomecoverthresholdswithoverlappingreferences
AT zenglerkarsten zebrastaticanddynamicgenomecoverthresholdswithoverlappingreferences
AT zaramelalivias zebrastaticanddynamicgenomecoverthresholdswithoverlappingreferences
AT nowinskibrent zebrastaticanddynamicgenomecoverthresholdswithoverlappingreferences
AT swaffordaustin zebrastaticanddynamicgenomecoverthresholdswithoverlappingreferences
AT zhuqiyun zebrastaticanddynamicgenomecoverthresholdswithoverlappingreferences
AT songsejin zebrastaticanddynamicgenomecoverthresholdswithoverlappingreferences
AT gonzalezantonio zebrastaticanddynamicgenomecoverthresholdswithoverlappingreferences
AT mcdonalddaniel zebrastaticanddynamicgenomecoverthresholdswithoverlappingreferences
AT knightrob zebrastaticanddynamicgenomecoverthresholdswithoverlappingreferences