Cargando…

LncCat: An ORF attention model to identify LncRNA based on ensemble learning strategy and fused sequence information

BACKGROUND: Long non-coding RNA (lncRNA) is one of the most essential forms of transcripts, playing crucial regulatory roles in the development of cancers and diseases without protein-coding ability. It was assumed that short ORFs (sORFs) in lncRNA were weak to translate proteins. However, recent re...

Descripción completa

Detalles Bibliográficos
Autores principales: Feng, Hongqi, Wang, Shaocong, Wang, Yan, Ni, Xinye, Yang, Zexi, Hu, Xuemei, Sen Yang
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Research Network of Computational and Structural Biotechnology 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9941877/
https://www.ncbi.nlm.nih.gov/pubmed/36824229
http://dx.doi.org/10.1016/j.csbj.2023.02.012
_version_ 1784891379643580416
author Feng, Hongqi
Wang, Shaocong
Wang, Yan
Ni, Xinye
Yang, Zexi
Hu, Xuemei
Sen Yang
author_facet Feng, Hongqi
Wang, Shaocong
Wang, Yan
Ni, Xinye
Yang, Zexi
Hu, Xuemei
Sen Yang
author_sort Feng, Hongqi
collection PubMed
description BACKGROUND: Long non-coding RNA (lncRNA) is one of the most essential forms of transcripts, playing crucial regulatory roles in the development of cancers and diseases without protein-coding ability. It was assumed that short ORFs (sORFs) in lncRNA were weak to translate proteins. However, recent research has shown that sORFs can encode peptides, which increases the difficulty to identify lncRNA. Therefore, identifying lncRNAs with sORFs facilitates finding novel regulatory factors. RESULTS: In this paper, we propose LncCat for identifying lncRNA based on category boosting (CatBoost) and ORF-attention features. LncCat combines five types of features to encode transcript sequences and employs CatBoost to build a prediction model. In addition, the visualization comparison reveals that the ORF-attention features between lncRNAs and protein-coding transcripts are significantly distinct. The comparison results show that LncCat outperforms competing methods on several benchmark datasets. For Matthew’s Correlation Coefficient (MCC), LncCat achieves 0.9503, 0.9219, 0.8591, 0.8672, and 0.9047 on the human, mouse, zebrafish, wheat, and chicken datasets, with improvements ranging from 1.90% to 7.82%, 1.49–17.63%, 6.11–21.50%, 3.02–51.64% and 5.35–26.90%, respectively. Moreover, LncCat dramatically improves the MCC by at least 11.90%, 12.96% and 42.61% on sORF test datasets of human, mouse, and zebrafish, respectively. CONCLUSIONS: Experiments indicate that LncCat performs better both on long ORF and sORF datasets, and ORF-attention features show positive effects on predicting lncRNA. In brief, LncCat is a reliable method for identifying lncRNA. Additionally, a user-friendly web server is developed for academics at http://cczubio.top/lnccat.
format Online
Article
Text
id pubmed-9941877
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Research Network of Computational and Structural Biotechnology
record_format MEDLINE/PubMed
spelling pubmed-99418772023-02-22 LncCat: An ORF attention model to identify LncRNA based on ensemble learning strategy and fused sequence information Feng, Hongqi Wang, Shaocong Wang, Yan Ni, Xinye Yang, Zexi Hu, Xuemei Sen Yang Comput Struct Biotechnol J Software/Web Server Article BACKGROUND: Long non-coding RNA (lncRNA) is one of the most essential forms of transcripts, playing crucial regulatory roles in the development of cancers and diseases without protein-coding ability. It was assumed that short ORFs (sORFs) in lncRNA were weak to translate proteins. However, recent research has shown that sORFs can encode peptides, which increases the difficulty to identify lncRNA. Therefore, identifying lncRNAs with sORFs facilitates finding novel regulatory factors. RESULTS: In this paper, we propose LncCat for identifying lncRNA based on category boosting (CatBoost) and ORF-attention features. LncCat combines five types of features to encode transcript sequences and employs CatBoost to build a prediction model. In addition, the visualization comparison reveals that the ORF-attention features between lncRNAs and protein-coding transcripts are significantly distinct. The comparison results show that LncCat outperforms competing methods on several benchmark datasets. For Matthew’s Correlation Coefficient (MCC), LncCat achieves 0.9503, 0.9219, 0.8591, 0.8672, and 0.9047 on the human, mouse, zebrafish, wheat, and chicken datasets, with improvements ranging from 1.90% to 7.82%, 1.49–17.63%, 6.11–21.50%, 3.02–51.64% and 5.35–26.90%, respectively. Moreover, LncCat dramatically improves the MCC by at least 11.90%, 12.96% and 42.61% on sORF test datasets of human, mouse, and zebrafish, respectively. CONCLUSIONS: Experiments indicate that LncCat performs better both on long ORF and sORF datasets, and ORF-attention features show positive effects on predicting lncRNA. In brief, LncCat is a reliable method for identifying lncRNA. Additionally, a user-friendly web server is developed for academics at http://cczubio.top/lnccat. Research Network of Computational and Structural Biotechnology 2023-02-08 /pmc/articles/PMC9941877/ /pubmed/36824229 http://dx.doi.org/10.1016/j.csbj.2023.02.012 Text en © 2023 The Author(s) https://creativecommons.org/licenses/by-nc-nd/4.0/This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
spellingShingle Software/Web Server Article
Feng, Hongqi
Wang, Shaocong
Wang, Yan
Ni, Xinye
Yang, Zexi
Hu, Xuemei
Sen Yang
LncCat: An ORF attention model to identify LncRNA based on ensemble learning strategy and fused sequence information
title LncCat: An ORF attention model to identify LncRNA based on ensemble learning strategy and fused sequence information
title_full LncCat: An ORF attention model to identify LncRNA based on ensemble learning strategy and fused sequence information
title_fullStr LncCat: An ORF attention model to identify LncRNA based on ensemble learning strategy and fused sequence information
title_full_unstemmed LncCat: An ORF attention model to identify LncRNA based on ensemble learning strategy and fused sequence information
title_short LncCat: An ORF attention model to identify LncRNA based on ensemble learning strategy and fused sequence information
title_sort lnccat: an orf attention model to identify lncrna based on ensemble learning strategy and fused sequence information
topic Software/Web Server Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9941877/
https://www.ncbi.nlm.nih.gov/pubmed/36824229
http://dx.doi.org/10.1016/j.csbj.2023.02.012
work_keys_str_mv AT fenghongqi lnccatanorfattentionmodeltoidentifylncrnabasedonensemblelearningstrategyandfusedsequenceinformation
AT wangshaocong lnccatanorfattentionmodeltoidentifylncrnabasedonensemblelearningstrategyandfusedsequenceinformation
AT wangyan lnccatanorfattentionmodeltoidentifylncrnabasedonensemblelearningstrategyandfusedsequenceinformation
AT nixinye lnccatanorfattentionmodeltoidentifylncrnabasedonensemblelearningstrategyandfusedsequenceinformation
AT yangzexi lnccatanorfattentionmodeltoidentifylncrnabasedonensemblelearningstrategyandfusedsequenceinformation
AT huxuemei lnccatanorfattentionmodeltoidentifylncrnabasedonensemblelearningstrategyandfusedsequenceinformation
AT senyang lnccatanorfattentionmodeltoidentifylncrnabasedonensemblelearningstrategyandfusedsequenceinformation