Cargando…

No one tool to rule them all: prokaryotic gene prediction tool annotations are highly dependent on the organism of study

MOTIVATION: The biases in CoDing Sequence (CDS) prediction tools, which have been based on historic genomic annotations from model organisms, impact our understanding of novel genomes and metagenomes. This hinders the discovery of new genomic information as it results in predictions being biased tow...

Descripción completa

Detalles Bibliográficos
Autores principales: Dimonaco, Nicholas J, Aubrey, Wayne, Kenobi, Kim, Clare, Amanda, Creevey, Christopher J
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8825762/
https://www.ncbi.nlm.nih.gov/pubmed/34875010
http://dx.doi.org/10.1093/bioinformatics/btab827
_version_ 1784647286742056960
author Dimonaco, Nicholas J
Aubrey, Wayne
Kenobi, Kim
Clare, Amanda
Creevey, Christopher J
author_facet Dimonaco, Nicholas J
Aubrey, Wayne
Kenobi, Kim
Clare, Amanda
Creevey, Christopher J
author_sort Dimonaco, Nicholas J
collection PubMed
description MOTIVATION: The biases in CoDing Sequence (CDS) prediction tools, which have been based on historic genomic annotations from model organisms, impact our understanding of novel genomes and metagenomes. This hinders the discovery of new genomic information as it results in predictions being biased towards existing knowledge. To date, users have lacked a systematic and replicable approach to identify the strengths and weaknesses of any CDS prediction tool and allow them to choose the right tool for their analysis. RESULTS: We present an evaluation framework (ORForise) based on a comprehensive set of 12 primary and 60 secondary metrics that facilitate the assessment of the performance of CDS prediction tools. This makes it possible to identify which performs better for specific use-cases. We use this to assess 15 ab initio- and model-based tools representing those most widely used (historically and currently) to generate the knowledge in genomic databases. We find that the performance of any tool is dependent on the genome being analysed, and no individual tool ranked as the most accurate across all genomes or metrics analysed. Even the top-ranked tools produced conflicting gene collections, which could not be resolved by aggregation. The ORForise evaluation framework provides users with a replicable, data-led approach to make informed tool choices for novel genome annotations and for refining historical annotations. AVAILABILITY AND IMPLEMENTATION: Code and datasets for reproduction and customisation are available at https://github.com/NickJD/ORForise. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-8825762
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-88257622022-02-09 No one tool to rule them all: prokaryotic gene prediction tool annotations are highly dependent on the organism of study Dimonaco, Nicholas J Aubrey, Wayne Kenobi, Kim Clare, Amanda Creevey, Christopher J Bioinformatics Original Papers MOTIVATION: The biases in CoDing Sequence (CDS) prediction tools, which have been based on historic genomic annotations from model organisms, impact our understanding of novel genomes and metagenomes. This hinders the discovery of new genomic information as it results in predictions being biased towards existing knowledge. To date, users have lacked a systematic and replicable approach to identify the strengths and weaknesses of any CDS prediction tool and allow them to choose the right tool for their analysis. RESULTS: We present an evaluation framework (ORForise) based on a comprehensive set of 12 primary and 60 secondary metrics that facilitate the assessment of the performance of CDS prediction tools. This makes it possible to identify which performs better for specific use-cases. We use this to assess 15 ab initio- and model-based tools representing those most widely used (historically and currently) to generate the knowledge in genomic databases. We find that the performance of any tool is dependent on the genome being analysed, and no individual tool ranked as the most accurate across all genomes or metrics analysed. Even the top-ranked tools produced conflicting gene collections, which could not be resolved by aggregation. The ORForise evaluation framework provides users with a replicable, data-led approach to make informed tool choices for novel genome annotations and for refining historical annotations. AVAILABILITY AND IMPLEMENTATION: Code and datasets for reproduction and customisation are available at https://github.com/NickJD/ORForise. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2021-12-07 /pmc/articles/PMC8825762/ /pubmed/34875010 http://dx.doi.org/10.1093/bioinformatics/btab827 Text en © The Author(s) 2021. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Papers
Dimonaco, Nicholas J
Aubrey, Wayne
Kenobi, Kim
Clare, Amanda
Creevey, Christopher J
No one tool to rule them all: prokaryotic gene prediction tool annotations are highly dependent on the organism of study
title No one tool to rule them all: prokaryotic gene prediction tool annotations are highly dependent on the organism of study
title_full No one tool to rule them all: prokaryotic gene prediction tool annotations are highly dependent on the organism of study
title_fullStr No one tool to rule them all: prokaryotic gene prediction tool annotations are highly dependent on the organism of study
title_full_unstemmed No one tool to rule them all: prokaryotic gene prediction tool annotations are highly dependent on the organism of study
title_short No one tool to rule them all: prokaryotic gene prediction tool annotations are highly dependent on the organism of study
title_sort no one tool to rule them all: prokaryotic gene prediction tool annotations are highly dependent on the organism of study
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8825762/
https://www.ncbi.nlm.nih.gov/pubmed/34875010
http://dx.doi.org/10.1093/bioinformatics/btab827
work_keys_str_mv AT dimonaconicholasj noonetooltorulethemallprokaryoticgenepredictiontoolannotationsarehighlydependentontheorganismofstudy
AT aubreywayne noonetooltorulethemallprokaryoticgenepredictiontoolannotationsarehighlydependentontheorganismofstudy
AT kenobikim noonetooltorulethemallprokaryoticgenepredictiontoolannotationsarehighlydependentontheorganismofstudy
AT clareamanda noonetooltorulethemallprokaryoticgenepredictiontoolannotationsarehighlydependentontheorganismofstudy
AT creeveychristopherj noonetooltorulethemallprokaryoticgenepredictiontoolannotationsarehighlydependentontheorganismofstudy