Cargando…

Identification of Cancer-Related Long Non-Coding RNAs Using XGBoost With High Accuracy

In the past decade, hundreds of long noncoding RNAs (lncRNAs) have been identified as significant players in diverse types of cancer; however, the functions and mechanisms of most lncRNAs in cancer remain unclear. Several computational methods have been developed to detect associations between cance...

Descripción completa

Detalles Bibliográficos
Autores principales: Zhang, Xuan, Li, Tianjun, Wang, Jun, Li, Jing, Chen, Long, Liu, Changning
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6701491/
https://www.ncbi.nlm.nih.gov/pubmed/31456817
http://dx.doi.org/10.3389/fgene.2019.00735
_version_ 1783445066495295488
author Zhang, Xuan
Li, Tianjun
Wang, Jun
Li, Jing
Chen, Long
Liu, Changning
author_facet Zhang, Xuan
Li, Tianjun
Wang, Jun
Li, Jing
Chen, Long
Liu, Changning
author_sort Zhang, Xuan
collection PubMed
description In the past decade, hundreds of long noncoding RNAs (lncRNAs) have been identified as significant players in diverse types of cancer; however, the functions and mechanisms of most lncRNAs in cancer remain unclear. Several computational methods have been developed to detect associations between cancer and lncRNAs, yet those approaches have limitations in both sensitivity and specificity. With the goal of improving the prediction accuracy for associations of lncRNA with cancer, we upgraded our previously developed cancer-related lncRNA classifier, CRlncRC, to generate CRlncRC2. CRlncRC2 is an eXtreme Gradient Boosting (XGBoost) machine learning framework, including Synthetic Minority Over-sampling Technique (SMOTE)-based over-sampling, along with Laplacian Score-based feature selection. Ten-fold cross-validation showed that the AUC value of CRlncRC2 for identification of cancer-related lncRNAs is much higher than previously reported by CRlncRC and others. Compared with CRlncRC, the number of features used by CRlncRC2 dropped from 85 to 51. Finally, we identified 439 cancer-related lncRNA candidates using CRlncRC2. To evaluate the accuracy of the predictions, we first consulted the cancer-related long non-coding RNA database Lnc2Cancer v2.0 and relevant literature for supporting information, then conducted statistical analysis of somatic mutations, distance from cancer genes, and differential expression in tumor tissues, using various data sets. The results showed that our approach was highly reliable for identifying cancer-related lncRNA candidates. Notably, the highest ranked candidate, lncRNA AC074117.1, has not been reported previously; however, integrated multi-omics analyses demonstrate that it is the target of multiple cancer-related miRNAs and interacts with adjacent protein-coding genes, suggesting that it may act as a cancer-related competing endogenous RNA, which warrants further investigation. In conclusion, CRlncRC2 is an effective and accurate method for identification of cancer-related lncRNAs, and has potential to contribute to the functional annotation of lncRNAs and guide cancer therapy.
format Online
Article
Text
id pubmed-6701491
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-67014912019-08-27 Identification of Cancer-Related Long Non-Coding RNAs Using XGBoost With High Accuracy Zhang, Xuan Li, Tianjun Wang, Jun Li, Jing Chen, Long Liu, Changning Front Genet Genetics In the past decade, hundreds of long noncoding RNAs (lncRNAs) have been identified as significant players in diverse types of cancer; however, the functions and mechanisms of most lncRNAs in cancer remain unclear. Several computational methods have been developed to detect associations between cancer and lncRNAs, yet those approaches have limitations in both sensitivity and specificity. With the goal of improving the prediction accuracy for associations of lncRNA with cancer, we upgraded our previously developed cancer-related lncRNA classifier, CRlncRC, to generate CRlncRC2. CRlncRC2 is an eXtreme Gradient Boosting (XGBoost) machine learning framework, including Synthetic Minority Over-sampling Technique (SMOTE)-based over-sampling, along with Laplacian Score-based feature selection. Ten-fold cross-validation showed that the AUC value of CRlncRC2 for identification of cancer-related lncRNAs is much higher than previously reported by CRlncRC and others. Compared with CRlncRC, the number of features used by CRlncRC2 dropped from 85 to 51. Finally, we identified 439 cancer-related lncRNA candidates using CRlncRC2. To evaluate the accuracy of the predictions, we first consulted the cancer-related long non-coding RNA database Lnc2Cancer v2.0 and relevant literature for supporting information, then conducted statistical analysis of somatic mutations, distance from cancer genes, and differential expression in tumor tissues, using various data sets. The results showed that our approach was highly reliable for identifying cancer-related lncRNA candidates. Notably, the highest ranked candidate, lncRNA AC074117.1, has not been reported previously; however, integrated multi-omics analyses demonstrate that it is the target of multiple cancer-related miRNAs and interacts with adjacent protein-coding genes, suggesting that it may act as a cancer-related competing endogenous RNA, which warrants further investigation. In conclusion, CRlncRC2 is an effective and accurate method for identification of cancer-related lncRNAs, and has potential to contribute to the functional annotation of lncRNAs and guide cancer therapy. Frontiers Media S.A. 2019-08-09 /pmc/articles/PMC6701491/ /pubmed/31456817 http://dx.doi.org/10.3389/fgene.2019.00735 Text en Copyright © 2019 Zhang, Li, Wang, Li, Chen and Liu http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Genetics
Zhang, Xuan
Li, Tianjun
Wang, Jun
Li, Jing
Chen, Long
Liu, Changning
Identification of Cancer-Related Long Non-Coding RNAs Using XGBoost With High Accuracy
title Identification of Cancer-Related Long Non-Coding RNAs Using XGBoost With High Accuracy
title_full Identification of Cancer-Related Long Non-Coding RNAs Using XGBoost With High Accuracy
title_fullStr Identification of Cancer-Related Long Non-Coding RNAs Using XGBoost With High Accuracy
title_full_unstemmed Identification of Cancer-Related Long Non-Coding RNAs Using XGBoost With High Accuracy
title_short Identification of Cancer-Related Long Non-Coding RNAs Using XGBoost With High Accuracy
title_sort identification of cancer-related long non-coding rnas using xgboost with high accuracy
topic Genetics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6701491/
https://www.ncbi.nlm.nih.gov/pubmed/31456817
http://dx.doi.org/10.3389/fgene.2019.00735
work_keys_str_mv AT zhangxuan identificationofcancerrelatedlongnoncodingrnasusingxgboostwithhighaccuracy
AT litianjun identificationofcancerrelatedlongnoncodingrnasusingxgboostwithhighaccuracy
AT wangjun identificationofcancerrelatedlongnoncodingrnasusingxgboostwithhighaccuracy
AT lijing identificationofcancerrelatedlongnoncodingrnasusingxgboostwithhighaccuracy
AT chenlong identificationofcancerrelatedlongnoncodingrnasusingxgboostwithhighaccuracy
AT liuchangning identificationofcancerrelatedlongnoncodingrnasusingxgboostwithhighaccuracy