Cargando…

A large-scale benchmark study of tools for the classification of protein-coding and non-coding RNAs

Identification of protein-coding and non-coding transcripts is paramount for understanding their biological roles. Computational approaches have been addressing this task for over a decade; however, generalized and high-performance models are still unreliable. This benchmark study assessed the perfo...

Descripción completa

Detalles Bibliográficos
Autores principales:	Singh, Dalwinder, Roy, Joy
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2022
Materias:	Computational Biology
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9757047/ https://www.ncbi.nlm.nih.gov/pubmed/36420898 http://dx.doi.org/10.1093/nar/gkac1092

_version_	1784851747471097856
author	Singh, Dalwinder Roy, Joy
author_facet	Singh, Dalwinder Roy, Joy
author_sort	Singh, Dalwinder
collection	PubMed
description	Identification of protein-coding and non-coding transcripts is paramount for understanding their biological roles. Computational approaches have been addressing this task for over a decade; however, generalized and high-performance models are still unreliable. This benchmark study assessed the performance of 24 tools producing >55 models on the datasets covering a wide range of species. We have collected 135 small and large transcriptomic datasets from existing studies for comparison and identified the potential bottlenecks hampering the performance of current tools. The key insights of this study include lack of standardized training sets, reliance on homogeneous training data, gradual changes in annotated data, lack of augmentation with homology searches, the presence of false positives and negatives in datasets and the lower performance of end-to-end deep learning models. We also derived a new dataset, RNAChallenge, from the benchmark considering hard instances that may include potential false alarms. The best and least well performing models under- and overfit the dataset, respectively, thereby serving a dual purpose. For computational approaches, it will be valuable to develop accurate and unbiased models. The identification of false alarms will be of interest for genome annotators, and experimental study of hard RNAs will help to untangle the complexity of the RNA world.
format	Online Article Text
id	pubmed-9757047
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-97570472022-12-19 A large-scale benchmark study of tools for the classification of protein-coding and non-coding RNAs Singh, Dalwinder Roy, Joy Nucleic Acids Res Computational Biology Identification of protein-coding and non-coding transcripts is paramount for understanding their biological roles. Computational approaches have been addressing this task for over a decade; however, generalized and high-performance models are still unreliable. This benchmark study assessed the performance of 24 tools producing >55 models on the datasets covering a wide range of species. We have collected 135 small and large transcriptomic datasets from existing studies for comparison and identified the potential bottlenecks hampering the performance of current tools. The key insights of this study include lack of standardized training sets, reliance on homogeneous training data, gradual changes in annotated data, lack of augmentation with homology searches, the presence of false positives and negatives in datasets and the lower performance of end-to-end deep learning models. We also derived a new dataset, RNAChallenge, from the benchmark considering hard instances that may include potential false alarms. The best and least well performing models under- and overfit the dataset, respectively, thereby serving a dual purpose. For computational approaches, it will be valuable to develop accurate and unbiased models. The identification of false alarms will be of interest for genome annotators, and experimental study of hard RNAs will help to untangle the complexity of the RNA world. Oxford University Press 2022-11-24 /pmc/articles/PMC9757047/ /pubmed/36420898 http://dx.doi.org/10.1093/nar/gkac1092 Text en © The Author(s) 2022. Published by Oxford University Press on behalf of Nucleic Acids Research. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Computational Biology Singh, Dalwinder Roy, Joy A large-scale benchmark study of tools for the classification of protein-coding and non-coding RNAs
title	A large-scale benchmark study of tools for the classification of protein-coding and non-coding RNAs
title_full	A large-scale benchmark study of tools for the classification of protein-coding and non-coding RNAs
title_fullStr	A large-scale benchmark study of tools for the classification of protein-coding and non-coding RNAs
title_full_unstemmed	A large-scale benchmark study of tools for the classification of protein-coding and non-coding RNAs
title_short	A large-scale benchmark study of tools for the classification of protein-coding and non-coding RNAs
title_sort	large-scale benchmark study of tools for the classification of protein-coding and non-coding rnas
topic	Computational Biology
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9757047/ https://www.ncbi.nlm.nih.gov/pubmed/36420898 http://dx.doi.org/10.1093/nar/gkac1092
work_keys_str_mv	AT singhdalwinder alargescalebenchmarkstudyoftoolsfortheclassificationofproteincodingandnoncodingrnas AT royjoy alargescalebenchmarkstudyoftoolsfortheclassificationofproteincodingandnoncodingrnas AT singhdalwinder largescalebenchmarkstudyoftoolsfortheclassificationofproteincodingandnoncodingrnas AT royjoy largescalebenchmarkstudyoftoolsfortheclassificationofproteincodingandnoncodingrnas

A large-scale benchmark study of tools for the classification of protein-coding and non-coding RNAs

Ejemplares similares