Cargando…
PseU-ST: A new stacked ensemble-learning method for identifying RNA pseudouridine sites
Background: Pseudouridine (Ψ) is one of the most abundant RNA modifications found in a variety of RNA types, and it plays a significant role in many biological processes. The key to studying the various biochemical functions and mechanisms of Ψ is to identify the Ψ sites. However, identifying Ψ site...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Frontiers Media S.A.
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9892456/ https://www.ncbi.nlm.nih.gov/pubmed/36741328 http://dx.doi.org/10.3389/fgene.2023.1121694 |
_version_ | 1784881327408939008 |
---|---|
author | Zhang, Xinru Wang, Shutao Xie, Lina Zhu, Yuhui |
author_facet | Zhang, Xinru Wang, Shutao Xie, Lina Zhu, Yuhui |
author_sort | Zhang, Xinru |
collection | PubMed |
description | Background: Pseudouridine (Ψ) is one of the most abundant RNA modifications found in a variety of RNA types, and it plays a significant role in many biological processes. The key to studying the various biochemical functions and mechanisms of Ψ is to identify the Ψ sites. However, identifying Ψ sites using experimental methods is time-consuming and expensive. Therefore, it is necessary to develop computational methods that can accurately predict Ψ sites based on RNA sequence information. Methods: In this study, we proposed a new model called PseU-ST to identify Ψ sites in Homo sapiens (H. sapiens), Saccharomyces cerevisiae (S. cerevisiae), and Mus musculus (M. musculus). We selected the best six encoding schemes and four machine learning algorithms based on a comprehensive test of almost all of the RNA sequence encoding schemes available in the iLearnPlus software package, and selected the optimal features for each encoding scheme using chi-square and incremental feature selection algorithms. Then, we selected the optimal feature combination and the best base-classifier combination for each species through an extensive performance comparison and employed a stacking strategy to build the predictive model. Results: The results demonstrated that PseU-ST achieved better prediction performance compared with other existing models. The PseU-ST accuracy scores were 93.64%, 87.74%, and 89.64% on H_990, S_628, and M_944, respectively, representing increments of 13.94%, 6.05%, and 0.26%, respectively, higher than the best existing methods on the same benchmark training datasets. Conclusion: The data indicate that PseU-ST is a very competitive prediction model for identifying RNA Ψ sites in H. sapiens, M. musculus, and S. cerevisiae. In addition, we found that the Position-specific trinucleotide propensity based on single strand (PSTNPss) and Position-specific of three nucleotides (PS3) features play an important role in Ψ site identification. The source code for PseU-ST and the data are obtainable in our GitHub repository (https://github.com/jluzhangxinrubio/PseU-ST). |
format | Online Article Text |
id | pubmed-9892456 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Frontiers Media S.A. |
record_format | MEDLINE/PubMed |
spelling | pubmed-98924562023-02-03 PseU-ST: A new stacked ensemble-learning method for identifying RNA pseudouridine sites Zhang, Xinru Wang, Shutao Xie, Lina Zhu, Yuhui Front Genet Genetics Background: Pseudouridine (Ψ) is one of the most abundant RNA modifications found in a variety of RNA types, and it plays a significant role in many biological processes. The key to studying the various biochemical functions and mechanisms of Ψ is to identify the Ψ sites. However, identifying Ψ sites using experimental methods is time-consuming and expensive. Therefore, it is necessary to develop computational methods that can accurately predict Ψ sites based on RNA sequence information. Methods: In this study, we proposed a new model called PseU-ST to identify Ψ sites in Homo sapiens (H. sapiens), Saccharomyces cerevisiae (S. cerevisiae), and Mus musculus (M. musculus). We selected the best six encoding schemes and four machine learning algorithms based on a comprehensive test of almost all of the RNA sequence encoding schemes available in the iLearnPlus software package, and selected the optimal features for each encoding scheme using chi-square and incremental feature selection algorithms. Then, we selected the optimal feature combination and the best base-classifier combination for each species through an extensive performance comparison and employed a stacking strategy to build the predictive model. Results: The results demonstrated that PseU-ST achieved better prediction performance compared with other existing models. The PseU-ST accuracy scores were 93.64%, 87.74%, and 89.64% on H_990, S_628, and M_944, respectively, representing increments of 13.94%, 6.05%, and 0.26%, respectively, higher than the best existing methods on the same benchmark training datasets. Conclusion: The data indicate that PseU-ST is a very competitive prediction model for identifying RNA Ψ sites in H. sapiens, M. musculus, and S. cerevisiae. In addition, we found that the Position-specific trinucleotide propensity based on single strand (PSTNPss) and Position-specific of three nucleotides (PS3) features play an important role in Ψ site identification. The source code for PseU-ST and the data are obtainable in our GitHub repository (https://github.com/jluzhangxinrubio/PseU-ST). Frontiers Media S.A. 2023-01-19 /pmc/articles/PMC9892456/ /pubmed/36741328 http://dx.doi.org/10.3389/fgene.2023.1121694 Text en Copyright © 2023 Zhang, Wang, Xie and Zhu. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms. |
spellingShingle | Genetics Zhang, Xinru Wang, Shutao Xie, Lina Zhu, Yuhui PseU-ST: A new stacked ensemble-learning method for identifying RNA pseudouridine sites |
title | PseU-ST: A new stacked ensemble-learning method for identifying RNA pseudouridine sites |
title_full | PseU-ST: A new stacked ensemble-learning method for identifying RNA pseudouridine sites |
title_fullStr | PseU-ST: A new stacked ensemble-learning method for identifying RNA pseudouridine sites |
title_full_unstemmed | PseU-ST: A new stacked ensemble-learning method for identifying RNA pseudouridine sites |
title_short | PseU-ST: A new stacked ensemble-learning method for identifying RNA pseudouridine sites |
title_sort | pseu-st: a new stacked ensemble-learning method for identifying rna pseudouridine sites |
topic | Genetics |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9892456/ https://www.ncbi.nlm.nih.gov/pubmed/36741328 http://dx.doi.org/10.3389/fgene.2023.1121694 |
work_keys_str_mv | AT zhangxinru pseustanewstackedensemblelearningmethodforidentifyingrnapseudouridinesites AT wangshutao pseustanewstackedensemblelearningmethodforidentifyingrnapseudouridinesites AT xielina pseustanewstackedensemblelearningmethodforidentifyingrnapseudouridinesites AT zhuyuhui pseustanewstackedensemblelearningmethodforidentifyingrnapseudouridinesites |