Cargando…

Learning distributed representations of RNA and protein sequences and its application for predicting lncRNA-protein interactions

The long noncoding RNAs (lncRNAs) are ubiquitous in organisms and play crucial role in a variety of biological processes and complex diseases. Emerging evidences suggest that lncRNAs interact with corresponding proteins to perform their regulatory functions. Therefore, identifying interacting lncRNA...

Descripción completa

Detalles Bibliográficos
Autores principales: Yi, Hai-Cheng, You, Zhu-Hong, Cheng, Li, Zhou, Xi, Jiang, Tong-Hai, Li, Xiao, Wang, Yan-Bin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Research Network of Computational and Structural Biotechnology 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6926125/
https://www.ncbi.nlm.nih.gov/pubmed/31890140
http://dx.doi.org/10.1016/j.csbj.2019.11.004
_version_ 1783482034263425024
author Yi, Hai-Cheng
You, Zhu-Hong
Cheng, Li
Zhou, Xi
Jiang, Tong-Hai
Li, Xiao
Wang, Yan-Bin
author_facet Yi, Hai-Cheng
You, Zhu-Hong
Cheng, Li
Zhou, Xi
Jiang, Tong-Hai
Li, Xiao
Wang, Yan-Bin
author_sort Yi, Hai-Cheng
collection PubMed
description The long noncoding RNAs (lncRNAs) are ubiquitous in organisms and play crucial role in a variety of biological processes and complex diseases. Emerging evidences suggest that lncRNAs interact with corresponding proteins to perform their regulatory functions. Therefore, identifying interacting lncRNA-protein pairs is the first step in understanding the function and mechanism of lncRNA. Since it is time-consuming and expensive to determine lncRNA-protein interactions by high-throughput experiments, more robust and accurate computational methods need to be developed. In this study, we developed a new sequence distributed representation learning based method for potential lncRNA-Protein Interactions Prediction, named LPI-Pred, which is inspired by the similarity between natural language and biological sequences. More specifically, lncRNA and protein sequences were divided into k-mer segmentation, which can be regard as “word” in natural language processing. Then, we trained out the RNA2vec and Pro2vec model using word2vec and human genome-wide lncRNA and protein sequences to mine distribution representation of RNA and protein. Then, the dimension of complex features is reduced by using feature selection based on Gini information impurity measure. Finally, these discriminative features are used to train a Random Forest classifier to predict lncRNA-protein interactions. Five-fold cross-validation was adopted to evaluate the performance of LPI-Pred on three benchmark datasets, including RPI369, RPI488 and RPI2241. The results demonstrate that LPI-Pred can be a useful tool to provide reliable guidance for biological research.
format Online
Article
Text
id pubmed-6926125
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Research Network of Computational and Structural Biotechnology
record_format MEDLINE/PubMed
spelling pubmed-69261252019-12-30 Learning distributed representations of RNA and protein sequences and its application for predicting lncRNA-protein interactions Yi, Hai-Cheng You, Zhu-Hong Cheng, Li Zhou, Xi Jiang, Tong-Hai Li, Xiao Wang, Yan-Bin Comput Struct Biotechnol J Research Article The long noncoding RNAs (lncRNAs) are ubiquitous in organisms and play crucial role in a variety of biological processes and complex diseases. Emerging evidences suggest that lncRNAs interact with corresponding proteins to perform their regulatory functions. Therefore, identifying interacting lncRNA-protein pairs is the first step in understanding the function and mechanism of lncRNA. Since it is time-consuming and expensive to determine lncRNA-protein interactions by high-throughput experiments, more robust and accurate computational methods need to be developed. In this study, we developed a new sequence distributed representation learning based method for potential lncRNA-Protein Interactions Prediction, named LPI-Pred, which is inspired by the similarity between natural language and biological sequences. More specifically, lncRNA and protein sequences were divided into k-mer segmentation, which can be regard as “word” in natural language processing. Then, we trained out the RNA2vec and Pro2vec model using word2vec and human genome-wide lncRNA and protein sequences to mine distribution representation of RNA and protein. Then, the dimension of complex features is reduced by using feature selection based on Gini information impurity measure. Finally, these discriminative features are used to train a Random Forest classifier to predict lncRNA-protein interactions. Five-fold cross-validation was adopted to evaluate the performance of LPI-Pred on three benchmark datasets, including RPI369, RPI488 and RPI2241. The results demonstrate that LPI-Pred can be a useful tool to provide reliable guidance for biological research. Research Network of Computational and Structural Biotechnology 2019-11-30 /pmc/articles/PMC6926125/ /pubmed/31890140 http://dx.doi.org/10.1016/j.csbj.2019.11.004 Text en © 2019 The Authors http://creativecommons.org/licenses/by/4.0/ This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Research Article
Yi, Hai-Cheng
You, Zhu-Hong
Cheng, Li
Zhou, Xi
Jiang, Tong-Hai
Li, Xiao
Wang, Yan-Bin
Learning distributed representations of RNA and protein sequences and its application for predicting lncRNA-protein interactions
title Learning distributed representations of RNA and protein sequences and its application for predicting lncRNA-protein interactions
title_full Learning distributed representations of RNA and protein sequences and its application for predicting lncRNA-protein interactions
title_fullStr Learning distributed representations of RNA and protein sequences and its application for predicting lncRNA-protein interactions
title_full_unstemmed Learning distributed representations of RNA and protein sequences and its application for predicting lncRNA-protein interactions
title_short Learning distributed representations of RNA and protein sequences and its application for predicting lncRNA-protein interactions
title_sort learning distributed representations of rna and protein sequences and its application for predicting lncrna-protein interactions
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6926125/
https://www.ncbi.nlm.nih.gov/pubmed/31890140
http://dx.doi.org/10.1016/j.csbj.2019.11.004
work_keys_str_mv AT yihaicheng learningdistributedrepresentationsofrnaandproteinsequencesanditsapplicationforpredictinglncrnaproteininteractions
AT youzhuhong learningdistributedrepresentationsofrnaandproteinsequencesanditsapplicationforpredictinglncrnaproteininteractions
AT chengli learningdistributedrepresentationsofrnaandproteinsequencesanditsapplicationforpredictinglncrnaproteininteractions
AT zhouxi learningdistributedrepresentationsofrnaandproteinsequencesanditsapplicationforpredictinglncrnaproteininteractions
AT jiangtonghai learningdistributedrepresentationsofrnaandproteinsequencesanditsapplicationforpredictinglncrnaproteininteractions
AT lixiao learningdistributedrepresentationsofrnaandproteinsequencesanditsapplicationforpredictinglncrnaproteininteractions
AT wangyanbin learningdistributedrepresentationsofrnaandproteinsequencesanditsapplicationforpredictinglncrnaproteininteractions