Cargando…
Accurate and fast graph-based pangenome annotation and clustering with ggCaller
Bacterial genomes differ in both gene content and sequence mutations, which underlie extensive phenotypic diversity, including variation in susceptibility to antimicrobials or vaccine-induced immunity. To identify and quantify important variants, all genes within a population must be predicted, func...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Cold Spring Harbor Laboratory Press
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10620059/ https://www.ncbi.nlm.nih.gov/pubmed/37620118 http://dx.doi.org/10.1101/gr.277733.123 |
_version_ | 1785130123313283072 |
---|---|
author | Horsfield, Samuel T. Tonkin-Hill, Gerry Croucher, Nicholas J. Lees, John A. |
author_facet | Horsfield, Samuel T. Tonkin-Hill, Gerry Croucher, Nicholas J. Lees, John A. |
author_sort | Horsfield, Samuel T. |
collection | PubMed |
description | Bacterial genomes differ in both gene content and sequence mutations, which underlie extensive phenotypic diversity, including variation in susceptibility to antimicrobials or vaccine-induced immunity. To identify and quantify important variants, all genes within a population must be predicted, functionally annotated, and clustered, representing the “pangenome.” Despite the volume of genome data available, gene prediction and annotation are currently conducted in isolation on individual genomes, which is computationally inefficient and frequently inconsistent across genomes. Here, we introduce the open-source software graph-gene-caller (ggCaller). ggCaller combines gene prediction, functional annotation, and clustering into a single workflow using population-wide de Bruijn graphs, removing redundancy in gene annotation and resulting in more accurate gene predictions and orthologue clustering. We applied ggCaller to simulated and real-world bacterial data sets containing hundreds or thousands of genomes, comparing it to current state-of-the-art tools. ggCaller has considerable speed-ups with equivalent or greater accuracy, particularly with data sets containing complex sources of error, such as assembly contamination or fragmentation. ggCaller is also an important extension to bacterial genome-wide association studies, enabling querying of annotated graphs for functional analyses. We highlight this application by functionally annotating DNA sequences with significant associations to tetracycline and macrolide resistance in Streptococcus pneumoniae, identifying key resistance determinants that were missed when using only a single reference genome. ggCaller is a novel bacterial genome analysis tool with applications in bacterial evolution and epidemiology. |
format | Online Article Text |
id | pubmed-10620059 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Cold Spring Harbor Laboratory Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-106200592023-11-02 Accurate and fast graph-based pangenome annotation and clustering with ggCaller Horsfield, Samuel T. Tonkin-Hill, Gerry Croucher, Nicholas J. Lees, John A. Genome Res Methods Bacterial genomes differ in both gene content and sequence mutations, which underlie extensive phenotypic diversity, including variation in susceptibility to antimicrobials or vaccine-induced immunity. To identify and quantify important variants, all genes within a population must be predicted, functionally annotated, and clustered, representing the “pangenome.” Despite the volume of genome data available, gene prediction and annotation are currently conducted in isolation on individual genomes, which is computationally inefficient and frequently inconsistent across genomes. Here, we introduce the open-source software graph-gene-caller (ggCaller). ggCaller combines gene prediction, functional annotation, and clustering into a single workflow using population-wide de Bruijn graphs, removing redundancy in gene annotation and resulting in more accurate gene predictions and orthologue clustering. We applied ggCaller to simulated and real-world bacterial data sets containing hundreds or thousands of genomes, comparing it to current state-of-the-art tools. ggCaller has considerable speed-ups with equivalent or greater accuracy, particularly with data sets containing complex sources of error, such as assembly contamination or fragmentation. ggCaller is also an important extension to bacterial genome-wide association studies, enabling querying of annotated graphs for functional analyses. We highlight this application by functionally annotating DNA sequences with significant associations to tetracycline and macrolide resistance in Streptococcus pneumoniae, identifying key resistance determinants that were missed when using only a single reference genome. ggCaller is a novel bacterial genome analysis tool with applications in bacterial evolution and epidemiology. Cold Spring Harbor Laboratory Press 2023-09 /pmc/articles/PMC10620059/ /pubmed/37620118 http://dx.doi.org/10.1101/gr.277733.123 Text en © 2023 Horsfield et al.; Published by Cold Spring Harbor Laboratory Press https://creativecommons.org/licenses/by/4.0/This article, published in Genome Research, is available under a Creative Commons License (Attribution 4.0 International), as described at http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . |
spellingShingle | Methods Horsfield, Samuel T. Tonkin-Hill, Gerry Croucher, Nicholas J. Lees, John A. Accurate and fast graph-based pangenome annotation and clustering with ggCaller |
title | Accurate and fast graph-based pangenome annotation and clustering with ggCaller |
title_full | Accurate and fast graph-based pangenome annotation and clustering with ggCaller |
title_fullStr | Accurate and fast graph-based pangenome annotation and clustering with ggCaller |
title_full_unstemmed | Accurate and fast graph-based pangenome annotation and clustering with ggCaller |
title_short | Accurate and fast graph-based pangenome annotation and clustering with ggCaller |
title_sort | accurate and fast graph-based pangenome annotation and clustering with ggcaller |
topic | Methods |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10620059/ https://www.ncbi.nlm.nih.gov/pubmed/37620118 http://dx.doi.org/10.1101/gr.277733.123 |
work_keys_str_mv | AT horsfieldsamuelt accurateandfastgraphbasedpangenomeannotationandclusteringwithggcaller AT tonkinhillgerry accurateandfastgraphbasedpangenomeannotationandclusteringwithggcaller AT crouchernicholasj accurateandfastgraphbasedpangenomeannotationandclusteringwithggcaller AT leesjohna accurateandfastgraphbasedpangenomeannotationandclusteringwithggcaller |