Cargando…
Rephine.r: a pipeline for correcting gene calls and clusters to improve phage pangenomes and phylogenies
BACKGROUND: A pangenome is the collection of all genes found in a set of related genomes. For microbes, these genomes are often different strains of the same species, and the pangenome offers a means to compare gene content variation with differences in phenotypes, ecology, and phylogenetic relatedn...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
PeerJ Inc.
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8351571/ https://www.ncbi.nlm.nih.gov/pubmed/34434663 http://dx.doi.org/10.7717/peerj.11950 |
_version_ | 1783736001531740160 |
---|---|
author | Shapiro, Jason W. Putonti, Catherine |
author_facet | Shapiro, Jason W. Putonti, Catherine |
author_sort | Shapiro, Jason W. |
collection | PubMed |
description | BACKGROUND: A pangenome is the collection of all genes found in a set of related genomes. For microbes, these genomes are often different strains of the same species, and the pangenome offers a means to compare gene content variation with differences in phenotypes, ecology, and phylogenetic relatedness. Though most frequently applied to bacteria, there is growing interest in adapting pangenome analysis to bacteriophages. However, working with phage genomes presents new challenges. First, most phage families are under-sampled, and homologous genes in related viruses can be difficult to identify. Second, homing endonucleases and intron-like sequences may be present, resulting in fragmented gene calls. Each of these issues can reduce the accuracy of standard pangenome analysis tools. METHODS: We developed an R pipeline called Rephine.r that takes as input the gene clusters produced by an initial pangenomics workflow. Rephine.r then proceeds in two primary steps. First, it identifies three common causes of fragmented gene calls: (1) indels creating early stop codons and new start codons; (2) interruption by a selfish genetic element; and (3) splitting at the ends of the reported genome. Fragmented genes are then fused to create new sequence alignments. In tandem, Rephine.r searches for distant homologs separated into different gene families using Hidden Markov Models. Significant hits are used to merge families into larger clusters. A final round of fragment identification is then run, and results may be used to infer single-copy core genomes and phylogenetic trees. RESULTS: We applied Rephine.r to three well-studied phage groups: the Tevenvirinae (e.g., T4), the Studiervirinae (e.g., T7), and the Pbunaviruses (e.g., PB1). In each case, Rephine.r recovered additional members of the single-copy core genome and increased the overall bootstrap support of the phylogeny. The Rephine.r pipeline is provided through GitHub (https://www.github.com/coevoeco/Rephine.r) as a single script for automated analysis and with utility functions to assist in building single-copy core genomes and predicting the sources of fragmented genes. |
format | Online Article Text |
id | pubmed-8351571 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | PeerJ Inc. |
record_format | MEDLINE/PubMed |
spelling | pubmed-83515712021-08-24 Rephine.r: a pipeline for correcting gene calls and clusters to improve phage pangenomes and phylogenies Shapiro, Jason W. Putonti, Catherine PeerJ Bioinformatics BACKGROUND: A pangenome is the collection of all genes found in a set of related genomes. For microbes, these genomes are often different strains of the same species, and the pangenome offers a means to compare gene content variation with differences in phenotypes, ecology, and phylogenetic relatedness. Though most frequently applied to bacteria, there is growing interest in adapting pangenome analysis to bacteriophages. However, working with phage genomes presents new challenges. First, most phage families are under-sampled, and homologous genes in related viruses can be difficult to identify. Second, homing endonucleases and intron-like sequences may be present, resulting in fragmented gene calls. Each of these issues can reduce the accuracy of standard pangenome analysis tools. METHODS: We developed an R pipeline called Rephine.r that takes as input the gene clusters produced by an initial pangenomics workflow. Rephine.r then proceeds in two primary steps. First, it identifies three common causes of fragmented gene calls: (1) indels creating early stop codons and new start codons; (2) interruption by a selfish genetic element; and (3) splitting at the ends of the reported genome. Fragmented genes are then fused to create new sequence alignments. In tandem, Rephine.r searches for distant homologs separated into different gene families using Hidden Markov Models. Significant hits are used to merge families into larger clusters. A final round of fragment identification is then run, and results may be used to infer single-copy core genomes and phylogenetic trees. RESULTS: We applied Rephine.r to three well-studied phage groups: the Tevenvirinae (e.g., T4), the Studiervirinae (e.g., T7), and the Pbunaviruses (e.g., PB1). In each case, Rephine.r recovered additional members of the single-copy core genome and increased the overall bootstrap support of the phylogeny. The Rephine.r pipeline is provided through GitHub (https://www.github.com/coevoeco/Rephine.r) as a single script for automated analysis and with utility functions to assist in building single-copy core genomes and predicting the sources of fragmented genes. PeerJ Inc. 2021-08-06 /pmc/articles/PMC8351571/ /pubmed/34434663 http://dx.doi.org/10.7717/peerj.11950 Text en ©2021 Shapiro and Putonti https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited. |
spellingShingle | Bioinformatics Shapiro, Jason W. Putonti, Catherine Rephine.r: a pipeline for correcting gene calls and clusters to improve phage pangenomes and phylogenies |
title | Rephine.r: a pipeline for correcting gene calls and clusters to improve phage pangenomes and phylogenies |
title_full | Rephine.r: a pipeline for correcting gene calls and clusters to improve phage pangenomes and phylogenies |
title_fullStr | Rephine.r: a pipeline for correcting gene calls and clusters to improve phage pangenomes and phylogenies |
title_full_unstemmed | Rephine.r: a pipeline for correcting gene calls and clusters to improve phage pangenomes and phylogenies |
title_short | Rephine.r: a pipeline for correcting gene calls and clusters to improve phage pangenomes and phylogenies |
title_sort | rephine.r: a pipeline for correcting gene calls and clusters to improve phage pangenomes and phylogenies |
topic | Bioinformatics |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8351571/ https://www.ncbi.nlm.nih.gov/pubmed/34434663 http://dx.doi.org/10.7717/peerj.11950 |
work_keys_str_mv | AT shapirojasonw rephinerapipelineforcorrectinggenecallsandclusterstoimprovephagepangenomesandphylogenies AT putonticatherine rephinerapipelineforcorrectinggenecallsandclusterstoimprovephagepangenomesandphylogenies |