Cargando…

PseU-ST: A new stacked ensemble-learning method for identifying RNA pseudouridine sites

Background: Pseudouridine (Ψ) is one of the most abundant RNA modifications found in a variety of RNA types, and it plays a significant role in many biological processes. The key to studying the various biochemical functions and mechanisms of Ψ is to identify the Ψ sites. However, identifying Ψ site...

Descripción completa

Detalles Bibliográficos
Autores principales: Zhang, Xinru, Wang, Shutao, Xie, Lina, Zhu, Yuhui
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9892456/
https://www.ncbi.nlm.nih.gov/pubmed/36741328
http://dx.doi.org/10.3389/fgene.2023.1121694
_version_ 1784881327408939008
author Zhang, Xinru
Wang, Shutao
Xie, Lina
Zhu, Yuhui
author_facet Zhang, Xinru
Wang, Shutao
Xie, Lina
Zhu, Yuhui
author_sort Zhang, Xinru
collection PubMed
description Background: Pseudouridine (Ψ) is one of the most abundant RNA modifications found in a variety of RNA types, and it plays a significant role in many biological processes. The key to studying the various biochemical functions and mechanisms of Ψ is to identify the Ψ sites. However, identifying Ψ sites using experimental methods is time-consuming and expensive. Therefore, it is necessary to develop computational methods that can accurately predict Ψ sites based on RNA sequence information. Methods: In this study, we proposed a new model called PseU-ST to identify Ψ sites in Homo sapiens (H. sapiens), Saccharomyces cerevisiae (S. cerevisiae), and Mus musculus (M. musculus). We selected the best six encoding schemes and four machine learning algorithms based on a comprehensive test of almost all of the RNA sequence encoding schemes available in the iLearnPlus software package, and selected the optimal features for each encoding scheme using chi-square and incremental feature selection algorithms. Then, we selected the optimal feature combination and the best base-classifier combination for each species through an extensive performance comparison and employed a stacking strategy to build the predictive model. Results: The results demonstrated that PseU-ST achieved better prediction performance compared with other existing models. The PseU-ST accuracy scores were 93.64%, 87.74%, and 89.64% on H_990, S_628, and M_944, respectively, representing increments of 13.94%, 6.05%, and 0.26%, respectively, higher than the best existing methods on the same benchmark training datasets. Conclusion: The data indicate that PseU-ST is a very competitive prediction model for identifying RNA Ψ sites in H. sapiens, M. musculus, and S. cerevisiae. In addition, we found that the Position-specific trinucleotide propensity based on single strand (PSTNPss) and Position-specific of three nucleotides (PS3) features play an important role in Ψ site identification. The source code for PseU-ST and the data are obtainable in our GitHub repository (https://github.com/jluzhangxinrubio/PseU-ST).
format Online
Article
Text
id pubmed-9892456
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-98924562023-02-03 PseU-ST: A new stacked ensemble-learning method for identifying RNA pseudouridine sites Zhang, Xinru Wang, Shutao Xie, Lina Zhu, Yuhui Front Genet Genetics Background: Pseudouridine (Ψ) is one of the most abundant RNA modifications found in a variety of RNA types, and it plays a significant role in many biological processes. The key to studying the various biochemical functions and mechanisms of Ψ is to identify the Ψ sites. However, identifying Ψ sites using experimental methods is time-consuming and expensive. Therefore, it is necessary to develop computational methods that can accurately predict Ψ sites based on RNA sequence information. Methods: In this study, we proposed a new model called PseU-ST to identify Ψ sites in Homo sapiens (H. sapiens), Saccharomyces cerevisiae (S. cerevisiae), and Mus musculus (M. musculus). We selected the best six encoding schemes and four machine learning algorithms based on a comprehensive test of almost all of the RNA sequence encoding schemes available in the iLearnPlus software package, and selected the optimal features for each encoding scheme using chi-square and incremental feature selection algorithms. Then, we selected the optimal feature combination and the best base-classifier combination for each species through an extensive performance comparison and employed a stacking strategy to build the predictive model. Results: The results demonstrated that PseU-ST achieved better prediction performance compared with other existing models. The PseU-ST accuracy scores were 93.64%, 87.74%, and 89.64% on H_990, S_628, and M_944, respectively, representing increments of 13.94%, 6.05%, and 0.26%, respectively, higher than the best existing methods on the same benchmark training datasets. Conclusion: The data indicate that PseU-ST is a very competitive prediction model for identifying RNA Ψ sites in H. sapiens, M. musculus, and S. cerevisiae. In addition, we found that the Position-specific trinucleotide propensity based on single strand (PSTNPss) and Position-specific of three nucleotides (PS3) features play an important role in Ψ site identification. The source code for PseU-ST and the data are obtainable in our GitHub repository (https://github.com/jluzhangxinrubio/PseU-ST). Frontiers Media S.A. 2023-01-19 /pmc/articles/PMC9892456/ /pubmed/36741328 http://dx.doi.org/10.3389/fgene.2023.1121694 Text en Copyright © 2023 Zhang, Wang, Xie and Zhu. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Genetics
Zhang, Xinru
Wang, Shutao
Xie, Lina
Zhu, Yuhui
PseU-ST: A new stacked ensemble-learning method for identifying RNA pseudouridine sites
title PseU-ST: A new stacked ensemble-learning method for identifying RNA pseudouridine sites
title_full PseU-ST: A new stacked ensemble-learning method for identifying RNA pseudouridine sites
title_fullStr PseU-ST: A new stacked ensemble-learning method for identifying RNA pseudouridine sites
title_full_unstemmed PseU-ST: A new stacked ensemble-learning method for identifying RNA pseudouridine sites
title_short PseU-ST: A new stacked ensemble-learning method for identifying RNA pseudouridine sites
title_sort pseu-st: a new stacked ensemble-learning method for identifying rna pseudouridine sites
topic Genetics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9892456/
https://www.ncbi.nlm.nih.gov/pubmed/36741328
http://dx.doi.org/10.3389/fgene.2023.1121694
work_keys_str_mv AT zhangxinru pseustanewstackedensemblelearningmethodforidentifyingrnapseudouridinesites
AT wangshutao pseustanewstackedensemblelearningmethodforidentifyingrnapseudouridinesites
AT xielina pseustanewstackedensemblelearningmethodforidentifyingrnapseudouridinesites
AT zhuyuhui pseustanewstackedensemblelearningmethodforidentifyingrnapseudouridinesites