Cargando…

Accurate and fast graph-based pangenome annotation and clustering with ggCaller

Bacterial genomes differ in both gene content and sequence mutations, which underlie extensive phenotypic diversity, including variation in susceptibility to antimicrobials or vaccine-induced immunity. To identify and quantify important variants, all genes within a population must be predicted, func...

Descripción completa

Detalles Bibliográficos
Autores principales: Horsfield, Samuel T., Tonkin-Hill, Gerry, Croucher, Nicholas J., Lees, John A.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cold Spring Harbor Laboratory Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10620059/
https://www.ncbi.nlm.nih.gov/pubmed/37620118
http://dx.doi.org/10.1101/gr.277733.123
_version_ 1785130123313283072
author Horsfield, Samuel T.
Tonkin-Hill, Gerry
Croucher, Nicholas J.
Lees, John A.
author_facet Horsfield, Samuel T.
Tonkin-Hill, Gerry
Croucher, Nicholas J.
Lees, John A.
author_sort Horsfield, Samuel T.
collection PubMed
description Bacterial genomes differ in both gene content and sequence mutations, which underlie extensive phenotypic diversity, including variation in susceptibility to antimicrobials or vaccine-induced immunity. To identify and quantify important variants, all genes within a population must be predicted, functionally annotated, and clustered, representing the “pangenome.” Despite the volume of genome data available, gene prediction and annotation are currently conducted in isolation on individual genomes, which is computationally inefficient and frequently inconsistent across genomes. Here, we introduce the open-source software graph-gene-caller (ggCaller). ggCaller combines gene prediction, functional annotation, and clustering into a single workflow using population-wide de Bruijn graphs, removing redundancy in gene annotation and resulting in more accurate gene predictions and orthologue clustering. We applied ggCaller to simulated and real-world bacterial data sets containing hundreds or thousands of genomes, comparing it to current state-of-the-art tools. ggCaller has considerable speed-ups with equivalent or greater accuracy, particularly with data sets containing complex sources of error, such as assembly contamination or fragmentation. ggCaller is also an important extension to bacterial genome-wide association studies, enabling querying of annotated graphs for functional analyses. We highlight this application by functionally annotating DNA sequences with significant associations to tetracycline and macrolide resistance in Streptococcus pneumoniae, identifying key resistance determinants that were missed when using only a single reference genome. ggCaller is a novel bacterial genome analysis tool with applications in bacterial evolution and epidemiology.
format Online
Article
Text
id pubmed-10620059
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Cold Spring Harbor Laboratory Press
record_format MEDLINE/PubMed
spelling pubmed-106200592023-11-02 Accurate and fast graph-based pangenome annotation and clustering with ggCaller Horsfield, Samuel T. Tonkin-Hill, Gerry Croucher, Nicholas J. Lees, John A. Genome Res Methods Bacterial genomes differ in both gene content and sequence mutations, which underlie extensive phenotypic diversity, including variation in susceptibility to antimicrobials or vaccine-induced immunity. To identify and quantify important variants, all genes within a population must be predicted, functionally annotated, and clustered, representing the “pangenome.” Despite the volume of genome data available, gene prediction and annotation are currently conducted in isolation on individual genomes, which is computationally inefficient and frequently inconsistent across genomes. Here, we introduce the open-source software graph-gene-caller (ggCaller). ggCaller combines gene prediction, functional annotation, and clustering into a single workflow using population-wide de Bruijn graphs, removing redundancy in gene annotation and resulting in more accurate gene predictions and orthologue clustering. We applied ggCaller to simulated and real-world bacterial data sets containing hundreds or thousands of genomes, comparing it to current state-of-the-art tools. ggCaller has considerable speed-ups with equivalent or greater accuracy, particularly with data sets containing complex sources of error, such as assembly contamination or fragmentation. ggCaller is also an important extension to bacterial genome-wide association studies, enabling querying of annotated graphs for functional analyses. We highlight this application by functionally annotating DNA sequences with significant associations to tetracycline and macrolide resistance in Streptococcus pneumoniae, identifying key resistance determinants that were missed when using only a single reference genome. ggCaller is a novel bacterial genome analysis tool with applications in bacterial evolution and epidemiology. Cold Spring Harbor Laboratory Press 2023-09 /pmc/articles/PMC10620059/ /pubmed/37620118 http://dx.doi.org/10.1101/gr.277733.123 Text en © 2023 Horsfield et al.; Published by Cold Spring Harbor Laboratory Press https://creativecommons.org/licenses/by/4.0/This article, published in Genome Research, is available under a Creative Commons License (Attribution 4.0 International), as described at http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Methods
Horsfield, Samuel T.
Tonkin-Hill, Gerry
Croucher, Nicholas J.
Lees, John A.
Accurate and fast graph-based pangenome annotation and clustering with ggCaller
title Accurate and fast graph-based pangenome annotation and clustering with ggCaller
title_full Accurate and fast graph-based pangenome annotation and clustering with ggCaller
title_fullStr Accurate and fast graph-based pangenome annotation and clustering with ggCaller
title_full_unstemmed Accurate and fast graph-based pangenome annotation and clustering with ggCaller
title_short Accurate and fast graph-based pangenome annotation and clustering with ggCaller
title_sort accurate and fast graph-based pangenome annotation and clustering with ggcaller
topic Methods
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10620059/
https://www.ncbi.nlm.nih.gov/pubmed/37620118
http://dx.doi.org/10.1101/gr.277733.123
work_keys_str_mv AT horsfieldsamuelt accurateandfastgraphbasedpangenomeannotationandclusteringwithggcaller
AT tonkinhillgerry accurateandfastgraphbasedpangenomeannotationandclusteringwithggcaller
AT crouchernicholasj accurateandfastgraphbasedpangenomeannotationandclusteringwithggcaller
AT leesjohna accurateandfastgraphbasedpangenomeannotationandclusteringwithggcaller