Cargando…
LncCat: An ORF attention model to identify LncRNA based on ensemble learning strategy and fused sequence information
BACKGROUND: Long non-coding RNA (lncRNA) is one of the most essential forms of transcripts, playing crucial regulatory roles in the development of cancers and diseases without protein-coding ability. It was assumed that short ORFs (sORFs) in lncRNA were weak to translate proteins. However, recent re...
Autores principales: | , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Research Network of Computational and Structural Biotechnology
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9941877/ https://www.ncbi.nlm.nih.gov/pubmed/36824229 http://dx.doi.org/10.1016/j.csbj.2023.02.012 |
_version_ | 1784891379643580416 |
---|---|
author | Feng, Hongqi Wang, Shaocong Wang, Yan Ni, Xinye Yang, Zexi Hu, Xuemei Sen Yang |
author_facet | Feng, Hongqi Wang, Shaocong Wang, Yan Ni, Xinye Yang, Zexi Hu, Xuemei Sen Yang |
author_sort | Feng, Hongqi |
collection | PubMed |
description | BACKGROUND: Long non-coding RNA (lncRNA) is one of the most essential forms of transcripts, playing crucial regulatory roles in the development of cancers and diseases without protein-coding ability. It was assumed that short ORFs (sORFs) in lncRNA were weak to translate proteins. However, recent research has shown that sORFs can encode peptides, which increases the difficulty to identify lncRNA. Therefore, identifying lncRNAs with sORFs facilitates finding novel regulatory factors. RESULTS: In this paper, we propose LncCat for identifying lncRNA based on category boosting (CatBoost) and ORF-attention features. LncCat combines five types of features to encode transcript sequences and employs CatBoost to build a prediction model. In addition, the visualization comparison reveals that the ORF-attention features between lncRNAs and protein-coding transcripts are significantly distinct. The comparison results show that LncCat outperforms competing methods on several benchmark datasets. For Matthew’s Correlation Coefficient (MCC), LncCat achieves 0.9503, 0.9219, 0.8591, 0.8672, and 0.9047 on the human, mouse, zebrafish, wheat, and chicken datasets, with improvements ranging from 1.90% to 7.82%, 1.49–17.63%, 6.11–21.50%, 3.02–51.64% and 5.35–26.90%, respectively. Moreover, LncCat dramatically improves the MCC by at least 11.90%, 12.96% and 42.61% on sORF test datasets of human, mouse, and zebrafish, respectively. CONCLUSIONS: Experiments indicate that LncCat performs better both on long ORF and sORF datasets, and ORF-attention features show positive effects on predicting lncRNA. In brief, LncCat is a reliable method for identifying lncRNA. Additionally, a user-friendly web server is developed for academics at http://cczubio.top/lnccat. |
format | Online Article Text |
id | pubmed-9941877 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Research Network of Computational and Structural Biotechnology |
record_format | MEDLINE/PubMed |
spelling | pubmed-99418772023-02-22 LncCat: An ORF attention model to identify LncRNA based on ensemble learning strategy and fused sequence information Feng, Hongqi Wang, Shaocong Wang, Yan Ni, Xinye Yang, Zexi Hu, Xuemei Sen Yang Comput Struct Biotechnol J Software/Web Server Article BACKGROUND: Long non-coding RNA (lncRNA) is one of the most essential forms of transcripts, playing crucial regulatory roles in the development of cancers and diseases without protein-coding ability. It was assumed that short ORFs (sORFs) in lncRNA were weak to translate proteins. However, recent research has shown that sORFs can encode peptides, which increases the difficulty to identify lncRNA. Therefore, identifying lncRNAs with sORFs facilitates finding novel regulatory factors. RESULTS: In this paper, we propose LncCat for identifying lncRNA based on category boosting (CatBoost) and ORF-attention features. LncCat combines five types of features to encode transcript sequences and employs CatBoost to build a prediction model. In addition, the visualization comparison reveals that the ORF-attention features between lncRNAs and protein-coding transcripts are significantly distinct. The comparison results show that LncCat outperforms competing methods on several benchmark datasets. For Matthew’s Correlation Coefficient (MCC), LncCat achieves 0.9503, 0.9219, 0.8591, 0.8672, and 0.9047 on the human, mouse, zebrafish, wheat, and chicken datasets, with improvements ranging from 1.90% to 7.82%, 1.49–17.63%, 6.11–21.50%, 3.02–51.64% and 5.35–26.90%, respectively. Moreover, LncCat dramatically improves the MCC by at least 11.90%, 12.96% and 42.61% on sORF test datasets of human, mouse, and zebrafish, respectively. CONCLUSIONS: Experiments indicate that LncCat performs better both on long ORF and sORF datasets, and ORF-attention features show positive effects on predicting lncRNA. In brief, LncCat is a reliable method for identifying lncRNA. Additionally, a user-friendly web server is developed for academics at http://cczubio.top/lnccat. Research Network of Computational and Structural Biotechnology 2023-02-08 /pmc/articles/PMC9941877/ /pubmed/36824229 http://dx.doi.org/10.1016/j.csbj.2023.02.012 Text en © 2023 The Author(s) https://creativecommons.org/licenses/by-nc-nd/4.0/This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). |
spellingShingle | Software/Web Server Article Feng, Hongqi Wang, Shaocong Wang, Yan Ni, Xinye Yang, Zexi Hu, Xuemei Sen Yang LncCat: An ORF attention model to identify LncRNA based on ensemble learning strategy and fused sequence information |
title | LncCat: An ORF attention model to identify LncRNA based on ensemble learning strategy and fused sequence information |
title_full | LncCat: An ORF attention model to identify LncRNA based on ensemble learning strategy and fused sequence information |
title_fullStr | LncCat: An ORF attention model to identify LncRNA based on ensemble learning strategy and fused sequence information |
title_full_unstemmed | LncCat: An ORF attention model to identify LncRNA based on ensemble learning strategy and fused sequence information |
title_short | LncCat: An ORF attention model to identify LncRNA based on ensemble learning strategy and fused sequence information |
title_sort | lnccat: an orf attention model to identify lncrna based on ensemble learning strategy and fused sequence information |
topic | Software/Web Server Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9941877/ https://www.ncbi.nlm.nih.gov/pubmed/36824229 http://dx.doi.org/10.1016/j.csbj.2023.02.012 |
work_keys_str_mv | AT fenghongqi lnccatanorfattentionmodeltoidentifylncrnabasedonensemblelearningstrategyandfusedsequenceinformation AT wangshaocong lnccatanorfattentionmodeltoidentifylncrnabasedonensemblelearningstrategyandfusedsequenceinformation AT wangyan lnccatanorfattentionmodeltoidentifylncrnabasedonensemblelearningstrategyandfusedsequenceinformation AT nixinye lnccatanorfattentionmodeltoidentifylncrnabasedonensemblelearningstrategyandfusedsequenceinformation AT yangzexi lnccatanorfattentionmodeltoidentifylncrnabasedonensemblelearningstrategyandfusedsequenceinformation AT huxuemei lnccatanorfattentionmodeltoidentifylncrnabasedonensemblelearningstrategyandfusedsequenceinformation AT senyang lnccatanorfattentionmodeltoidentifylncrnabasedonensemblelearningstrategyandfusedsequenceinformation |