Cargando…

BRAKER3: Fully Automated Genome Annotation Using RNA-Seq and Protein Evidence with GeneMark-ETP, AUGUSTUS and TSEBRA

Gene prediction remains an active area of bioinformatics research. Challenges are presented by large eukaryotic genomes and heterogeneous data situations. To meet the challenges, several streams of evidence must be integrated, from protein homology and transcriptome data, as well as information deri...

Descripción completa

Detalles Bibliográficos
Autores principales: Gabriel, Lars, Brůna, Tomáš, Hoff, Katharina J., Ebel, Matthis, Lomsadze, Alexandre, Borodovsky, Mark, Stanke, Mario
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cold Spring Harbor Laboratory 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10312602/
https://www.ncbi.nlm.nih.gov/pubmed/37398387
http://dx.doi.org/10.1101/2023.06.10.544449
_version_ 1785066956598018048
author Gabriel, Lars
Brůna, Tomáš
Hoff, Katharina J.
Ebel, Matthis
Lomsadze, Alexandre
Borodovsky, Mark
Stanke, Mario
author_facet Gabriel, Lars
Brůna, Tomáš
Hoff, Katharina J.
Ebel, Matthis
Lomsadze, Alexandre
Borodovsky, Mark
Stanke, Mario
author_sort Gabriel, Lars
collection PubMed
description Gene prediction remains an active area of bioinformatics research. Challenges are presented by large eukaryotic genomes and heterogeneous data situations. To meet the challenges, several streams of evidence must be integrated, from protein homology and transcriptome data, as well as information derived from the genome itself. The amount and significance of the available evidence from transcriptomes and proteomes vary from genome to genome, between genes and even along a single gene. User-friendly and accurate annotation pipelines that can cope with such data heterogeneity are needed. The previously developed annotation pipelines BRAKER1 and BRAKER2 use RNA-Seq or protein data, respectively, but not both. The recently released GeneMark-ETP integrates all three types of data and achieves much higher levels of accuracy. We here present the BRAKER3 pipeline that builds on GeneMark-ETP and AUGUSTUS and further improves accuracy using the TSEBRA combiner. BRAKER3 annotates protein-coding genes in eukaryotic genomes using both short-read RNA-Seq and a large protein database along with statistical models learned iteratively and specifically for the target genome. We benchmarked the new pipeline on 11 species under controlled conditions on the assumed relatedness of the target species to available proteomes. BRAKER3 outperformed BRAKER1 and BRAKER2, increasing the average transcript-level F1-score by ~20 percentage points, most pronounced for species with large and complex genomes. BRAKER3 also outperforms MAKER2 and Funannotate. For the first time, we provide a Singularity container for the BRAKER software to minimize installation obstacles. Overall, BRAKER3 is an accurate, easy-to-use tool for the annotation of eukaryotic genomes.
format Online
Article
Text
id pubmed-10312602
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Cold Spring Harbor Laboratory
record_format MEDLINE/PubMed
spelling pubmed-103126022023-07-01 BRAKER3: Fully Automated Genome Annotation Using RNA-Seq and Protein Evidence with GeneMark-ETP, AUGUSTUS and TSEBRA Gabriel, Lars Brůna, Tomáš Hoff, Katharina J. Ebel, Matthis Lomsadze, Alexandre Borodovsky, Mark Stanke, Mario bioRxiv Article Gene prediction remains an active area of bioinformatics research. Challenges are presented by large eukaryotic genomes and heterogeneous data situations. To meet the challenges, several streams of evidence must be integrated, from protein homology and transcriptome data, as well as information derived from the genome itself. The amount and significance of the available evidence from transcriptomes and proteomes vary from genome to genome, between genes and even along a single gene. User-friendly and accurate annotation pipelines that can cope with such data heterogeneity are needed. The previously developed annotation pipelines BRAKER1 and BRAKER2 use RNA-Seq or protein data, respectively, but not both. The recently released GeneMark-ETP integrates all three types of data and achieves much higher levels of accuracy. We here present the BRAKER3 pipeline that builds on GeneMark-ETP and AUGUSTUS and further improves accuracy using the TSEBRA combiner. BRAKER3 annotates protein-coding genes in eukaryotic genomes using both short-read RNA-Seq and a large protein database along with statistical models learned iteratively and specifically for the target genome. We benchmarked the new pipeline on 11 species under controlled conditions on the assumed relatedness of the target species to available proteomes. BRAKER3 outperformed BRAKER1 and BRAKER2, increasing the average transcript-level F1-score by ~20 percentage points, most pronounced for species with large and complex genomes. BRAKER3 also outperforms MAKER2 and Funannotate. For the first time, we provide a Singularity container for the BRAKER software to minimize installation obstacles. Overall, BRAKER3 is an accurate, easy-to-use tool for the annotation of eukaryotic genomes. Cold Spring Harbor Laboratory 2023-09-02 /pmc/articles/PMC10312602/ /pubmed/37398387 http://dx.doi.org/10.1101/2023.06.10.544449 Text en https://creativecommons.org/licenses/by/4.0/This work is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/) , which allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use.
spellingShingle Article
Gabriel, Lars
Brůna, Tomáš
Hoff, Katharina J.
Ebel, Matthis
Lomsadze, Alexandre
Borodovsky, Mark
Stanke, Mario
BRAKER3: Fully Automated Genome Annotation Using RNA-Seq and Protein Evidence with GeneMark-ETP, AUGUSTUS and TSEBRA
title BRAKER3: Fully Automated Genome Annotation Using RNA-Seq and Protein Evidence with GeneMark-ETP, AUGUSTUS and TSEBRA
title_full BRAKER3: Fully Automated Genome Annotation Using RNA-Seq and Protein Evidence with GeneMark-ETP, AUGUSTUS and TSEBRA
title_fullStr BRAKER3: Fully Automated Genome Annotation Using RNA-Seq and Protein Evidence with GeneMark-ETP, AUGUSTUS and TSEBRA
title_full_unstemmed BRAKER3: Fully Automated Genome Annotation Using RNA-Seq and Protein Evidence with GeneMark-ETP, AUGUSTUS and TSEBRA
title_short BRAKER3: Fully Automated Genome Annotation Using RNA-Seq and Protein Evidence with GeneMark-ETP, AUGUSTUS and TSEBRA
title_sort braker3: fully automated genome annotation using rna-seq and protein evidence with genemark-etp, augustus and tsebra
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10312602/
https://www.ncbi.nlm.nih.gov/pubmed/37398387
http://dx.doi.org/10.1101/2023.06.10.544449
work_keys_str_mv AT gabriellars braker3fullyautomatedgenomeannotationusingrnaseqandproteinevidencewithgenemarketpaugustusandtsebra
AT brunatomas braker3fullyautomatedgenomeannotationusingrnaseqandproteinevidencewithgenemarketpaugustusandtsebra
AT hoffkatharinaj braker3fullyautomatedgenomeannotationusingrnaseqandproteinevidencewithgenemarketpaugustusandtsebra
AT ebelmatthis braker3fullyautomatedgenomeannotationusingrnaseqandproteinevidencewithgenemarketpaugustusandtsebra
AT lomsadzealexandre braker3fullyautomatedgenomeannotationusingrnaseqandproteinevidencewithgenemarketpaugustusandtsebra
AT borodovskymark braker3fullyautomatedgenomeannotationusingrnaseqandproteinevidencewithgenemarketpaugustusandtsebra
AT stankemario braker3fullyautomatedgenomeannotationusingrnaseqandproteinevidencewithgenemarketpaugustusandtsebra