Cargando…

A large-scale benchmark study of tools for the classification of protein-coding and non-coding RNAs

Identification of protein-coding and non-coding transcripts is paramount for understanding their biological roles. Computational approaches have been addressing this task for over a decade; however, generalized and high-performance models are still unreliable. This benchmark study assessed the perfo...

Descripción completa

Detalles Bibliográficos
Autores principales: Singh, Dalwinder, Roy, Joy
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9757047/
https://www.ncbi.nlm.nih.gov/pubmed/36420898
http://dx.doi.org/10.1093/nar/gkac1092
_version_ 1784851747471097856
author Singh, Dalwinder
Roy, Joy
author_facet Singh, Dalwinder
Roy, Joy
author_sort Singh, Dalwinder
collection PubMed
description Identification of protein-coding and non-coding transcripts is paramount for understanding their biological roles. Computational approaches have been addressing this task for over a decade; however, generalized and high-performance models are still unreliable. This benchmark study assessed the performance of 24 tools producing >55 models on the datasets covering a wide range of species. We have collected 135 small and large transcriptomic datasets from existing studies for comparison and identified the potential bottlenecks hampering the performance of current tools. The key insights of this study include lack of standardized training sets, reliance on homogeneous training data, gradual changes in annotated data, lack of augmentation with homology searches, the presence of false positives and negatives in datasets and the lower performance of end-to-end deep learning models. We also derived a new dataset, RNAChallenge, from the benchmark considering hard instances that may include potential false alarms. The best and least well performing models under- and overfit the dataset, respectively, thereby serving a dual purpose. For computational approaches, it will be valuable to develop accurate and unbiased models. The identification of false alarms will be of interest for genome annotators, and experimental study of hard RNAs will help to untangle the complexity of the RNA world.
format Online
Article
Text
id pubmed-9757047
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-97570472022-12-19 A large-scale benchmark study of tools for the classification of protein-coding and non-coding RNAs Singh, Dalwinder Roy, Joy Nucleic Acids Res Computational Biology Identification of protein-coding and non-coding transcripts is paramount for understanding their biological roles. Computational approaches have been addressing this task for over a decade; however, generalized and high-performance models are still unreliable. This benchmark study assessed the performance of 24 tools producing >55 models on the datasets covering a wide range of species. We have collected 135 small and large transcriptomic datasets from existing studies for comparison and identified the potential bottlenecks hampering the performance of current tools. The key insights of this study include lack of standardized training sets, reliance on homogeneous training data, gradual changes in annotated data, lack of augmentation with homology searches, the presence of false positives and negatives in datasets and the lower performance of end-to-end deep learning models. We also derived a new dataset, RNAChallenge, from the benchmark considering hard instances that may include potential false alarms. The best and least well performing models under- and overfit the dataset, respectively, thereby serving a dual purpose. For computational approaches, it will be valuable to develop accurate and unbiased models. The identification of false alarms will be of interest for genome annotators, and experimental study of hard RNAs will help to untangle the complexity of the RNA world. Oxford University Press 2022-11-24 /pmc/articles/PMC9757047/ /pubmed/36420898 http://dx.doi.org/10.1093/nar/gkac1092 Text en © The Author(s) 2022. Published by Oxford University Press on behalf of Nucleic Acids Research. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Computational Biology
Singh, Dalwinder
Roy, Joy
A large-scale benchmark study of tools for the classification of protein-coding and non-coding RNAs
title A large-scale benchmark study of tools for the classification of protein-coding and non-coding RNAs
title_full A large-scale benchmark study of tools for the classification of protein-coding and non-coding RNAs
title_fullStr A large-scale benchmark study of tools for the classification of protein-coding and non-coding RNAs
title_full_unstemmed A large-scale benchmark study of tools for the classification of protein-coding and non-coding RNAs
title_short A large-scale benchmark study of tools for the classification of protein-coding and non-coding RNAs
title_sort large-scale benchmark study of tools for the classification of protein-coding and non-coding rnas
topic Computational Biology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9757047/
https://www.ncbi.nlm.nih.gov/pubmed/36420898
http://dx.doi.org/10.1093/nar/gkac1092
work_keys_str_mv AT singhdalwinder alargescalebenchmarkstudyoftoolsfortheclassificationofproteincodingandnoncodingrnas
AT royjoy alargescalebenchmarkstudyoftoolsfortheclassificationofproteincodingandnoncodingrnas
AT singhdalwinder largescalebenchmarkstudyoftoolsfortheclassificationofproteincodingandnoncodingrnas
AT royjoy largescalebenchmarkstudyoftoolsfortheclassificationofproteincodingandnoncodingrnas