Cargando…
A large-scale benchmark study of tools for the classification of protein-coding and non-coding RNAs
Identification of protein-coding and non-coding transcripts is paramount for understanding their biological roles. Computational approaches have been addressing this task for over a decade; however, generalized and high-performance models are still unreliable. This benchmark study assessed the perfo...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9757047/ https://www.ncbi.nlm.nih.gov/pubmed/36420898 http://dx.doi.org/10.1093/nar/gkac1092 |
_version_ | 1784851747471097856 |
---|---|
author | Singh, Dalwinder Roy, Joy |
author_facet | Singh, Dalwinder Roy, Joy |
author_sort | Singh, Dalwinder |
collection | PubMed |
description | Identification of protein-coding and non-coding transcripts is paramount for understanding their biological roles. Computational approaches have been addressing this task for over a decade; however, generalized and high-performance models are still unreliable. This benchmark study assessed the performance of 24 tools producing >55 models on the datasets covering a wide range of species. We have collected 135 small and large transcriptomic datasets from existing studies for comparison and identified the potential bottlenecks hampering the performance of current tools. The key insights of this study include lack of standardized training sets, reliance on homogeneous training data, gradual changes in annotated data, lack of augmentation with homology searches, the presence of false positives and negatives in datasets and the lower performance of end-to-end deep learning models. We also derived a new dataset, RNAChallenge, from the benchmark considering hard instances that may include potential false alarms. The best and least well performing models under- and overfit the dataset, respectively, thereby serving a dual purpose. For computational approaches, it will be valuable to develop accurate and unbiased models. The identification of false alarms will be of interest for genome annotators, and experimental study of hard RNAs will help to untangle the complexity of the RNA world. |
format | Online Article Text |
id | pubmed-9757047 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-97570472022-12-19 A large-scale benchmark study of tools for the classification of protein-coding and non-coding RNAs Singh, Dalwinder Roy, Joy Nucleic Acids Res Computational Biology Identification of protein-coding and non-coding transcripts is paramount for understanding their biological roles. Computational approaches have been addressing this task for over a decade; however, generalized and high-performance models are still unreliable. This benchmark study assessed the performance of 24 tools producing >55 models on the datasets covering a wide range of species. We have collected 135 small and large transcriptomic datasets from existing studies for comparison and identified the potential bottlenecks hampering the performance of current tools. The key insights of this study include lack of standardized training sets, reliance on homogeneous training data, gradual changes in annotated data, lack of augmentation with homology searches, the presence of false positives and negatives in datasets and the lower performance of end-to-end deep learning models. We also derived a new dataset, RNAChallenge, from the benchmark considering hard instances that may include potential false alarms. The best and least well performing models under- and overfit the dataset, respectively, thereby serving a dual purpose. For computational approaches, it will be valuable to develop accurate and unbiased models. The identification of false alarms will be of interest for genome annotators, and experimental study of hard RNAs will help to untangle the complexity of the RNA world. Oxford University Press 2022-11-24 /pmc/articles/PMC9757047/ /pubmed/36420898 http://dx.doi.org/10.1093/nar/gkac1092 Text en © The Author(s) 2022. Published by Oxford University Press on behalf of Nucleic Acids Research. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Computational Biology Singh, Dalwinder Roy, Joy A large-scale benchmark study of tools for the classification of protein-coding and non-coding RNAs |
title | A large-scale benchmark study of tools for the classification of protein-coding and non-coding RNAs |
title_full | A large-scale benchmark study of tools for the classification of protein-coding and non-coding RNAs |
title_fullStr | A large-scale benchmark study of tools for the classification of protein-coding and non-coding RNAs |
title_full_unstemmed | A large-scale benchmark study of tools for the classification of protein-coding and non-coding RNAs |
title_short | A large-scale benchmark study of tools for the classification of protein-coding and non-coding RNAs |
title_sort | large-scale benchmark study of tools for the classification of protein-coding and non-coding rnas |
topic | Computational Biology |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9757047/ https://www.ncbi.nlm.nih.gov/pubmed/36420898 http://dx.doi.org/10.1093/nar/gkac1092 |
work_keys_str_mv | AT singhdalwinder alargescalebenchmarkstudyoftoolsfortheclassificationofproteincodingandnoncodingrnas AT royjoy alargescalebenchmarkstudyoftoolsfortheclassificationofproteincodingandnoncodingrnas AT singhdalwinder largescalebenchmarkstudyoftoolsfortheclassificationofproteincodingandnoncodingrnas AT royjoy largescalebenchmarkstudyoftoolsfortheclassificationofproteincodingandnoncodingrnas |