Cargando…
Learning distributed representations of RNA and protein sequences and its application for predicting lncRNA-protein interactions
The long noncoding RNAs (lncRNAs) are ubiquitous in organisms and play crucial role in a variety of biological processes and complex diseases. Emerging evidences suggest that lncRNAs interact with corresponding proteins to perform their regulatory functions. Therefore, identifying interacting lncRNA...
Autores principales: | , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Research Network of Computational and Structural Biotechnology
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6926125/ https://www.ncbi.nlm.nih.gov/pubmed/31890140 http://dx.doi.org/10.1016/j.csbj.2019.11.004 |
_version_ | 1783482034263425024 |
---|---|
author | Yi, Hai-Cheng You, Zhu-Hong Cheng, Li Zhou, Xi Jiang, Tong-Hai Li, Xiao Wang, Yan-Bin |
author_facet | Yi, Hai-Cheng You, Zhu-Hong Cheng, Li Zhou, Xi Jiang, Tong-Hai Li, Xiao Wang, Yan-Bin |
author_sort | Yi, Hai-Cheng |
collection | PubMed |
description | The long noncoding RNAs (lncRNAs) are ubiquitous in organisms and play crucial role in a variety of biological processes and complex diseases. Emerging evidences suggest that lncRNAs interact with corresponding proteins to perform their regulatory functions. Therefore, identifying interacting lncRNA-protein pairs is the first step in understanding the function and mechanism of lncRNA. Since it is time-consuming and expensive to determine lncRNA-protein interactions by high-throughput experiments, more robust and accurate computational methods need to be developed. In this study, we developed a new sequence distributed representation learning based method for potential lncRNA-Protein Interactions Prediction, named LPI-Pred, which is inspired by the similarity between natural language and biological sequences. More specifically, lncRNA and protein sequences were divided into k-mer segmentation, which can be regard as “word” in natural language processing. Then, we trained out the RNA2vec and Pro2vec model using word2vec and human genome-wide lncRNA and protein sequences to mine distribution representation of RNA and protein. Then, the dimension of complex features is reduced by using feature selection based on Gini information impurity measure. Finally, these discriminative features are used to train a Random Forest classifier to predict lncRNA-protein interactions. Five-fold cross-validation was adopted to evaluate the performance of LPI-Pred on three benchmark datasets, including RPI369, RPI488 and RPI2241. The results demonstrate that LPI-Pred can be a useful tool to provide reliable guidance for biological research. |
format | Online Article Text |
id | pubmed-6926125 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2019 |
publisher | Research Network of Computational and Structural Biotechnology |
record_format | MEDLINE/PubMed |
spelling | pubmed-69261252019-12-30 Learning distributed representations of RNA and protein sequences and its application for predicting lncRNA-protein interactions Yi, Hai-Cheng You, Zhu-Hong Cheng, Li Zhou, Xi Jiang, Tong-Hai Li, Xiao Wang, Yan-Bin Comput Struct Biotechnol J Research Article The long noncoding RNAs (lncRNAs) are ubiquitous in organisms and play crucial role in a variety of biological processes and complex diseases. Emerging evidences suggest that lncRNAs interact with corresponding proteins to perform their regulatory functions. Therefore, identifying interacting lncRNA-protein pairs is the first step in understanding the function and mechanism of lncRNA. Since it is time-consuming and expensive to determine lncRNA-protein interactions by high-throughput experiments, more robust and accurate computational methods need to be developed. In this study, we developed a new sequence distributed representation learning based method for potential lncRNA-Protein Interactions Prediction, named LPI-Pred, which is inspired by the similarity between natural language and biological sequences. More specifically, lncRNA and protein sequences were divided into k-mer segmentation, which can be regard as “word” in natural language processing. Then, we trained out the RNA2vec and Pro2vec model using word2vec and human genome-wide lncRNA and protein sequences to mine distribution representation of RNA and protein. Then, the dimension of complex features is reduced by using feature selection based on Gini information impurity measure. Finally, these discriminative features are used to train a Random Forest classifier to predict lncRNA-protein interactions. Five-fold cross-validation was adopted to evaluate the performance of LPI-Pred on three benchmark datasets, including RPI369, RPI488 and RPI2241. The results demonstrate that LPI-Pred can be a useful tool to provide reliable guidance for biological research. Research Network of Computational and Structural Biotechnology 2019-11-30 /pmc/articles/PMC6926125/ /pubmed/31890140 http://dx.doi.org/10.1016/j.csbj.2019.11.004 Text en © 2019 The Authors http://creativecommons.org/licenses/by/4.0/ This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Research Article Yi, Hai-Cheng You, Zhu-Hong Cheng, Li Zhou, Xi Jiang, Tong-Hai Li, Xiao Wang, Yan-Bin Learning distributed representations of RNA and protein sequences and its application for predicting lncRNA-protein interactions |
title | Learning distributed representations of RNA and protein sequences and its application for predicting lncRNA-protein interactions |
title_full | Learning distributed representations of RNA and protein sequences and its application for predicting lncRNA-protein interactions |
title_fullStr | Learning distributed representations of RNA and protein sequences and its application for predicting lncRNA-protein interactions |
title_full_unstemmed | Learning distributed representations of RNA and protein sequences and its application for predicting lncRNA-protein interactions |
title_short | Learning distributed representations of RNA and protein sequences and its application for predicting lncRNA-protein interactions |
title_sort | learning distributed representations of rna and protein sequences and its application for predicting lncrna-protein interactions |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6926125/ https://www.ncbi.nlm.nih.gov/pubmed/31890140 http://dx.doi.org/10.1016/j.csbj.2019.11.004 |
work_keys_str_mv | AT yihaicheng learningdistributedrepresentationsofrnaandproteinsequencesanditsapplicationforpredictinglncrnaproteininteractions AT youzhuhong learningdistributedrepresentationsofrnaandproteinsequencesanditsapplicationforpredictinglncrnaproteininteractions AT chengli learningdistributedrepresentationsofrnaandproteinsequencesanditsapplicationforpredictinglncrnaproteininteractions AT zhouxi learningdistributedrepresentationsofrnaandproteinsequencesanditsapplicationforpredictinglncrnaproteininteractions AT jiangtonghai learningdistributedrepresentationsofrnaandproteinsequencesanditsapplicationforpredictinglncrnaproteininteractions AT lixiao learningdistributedrepresentationsofrnaandproteinsequencesanditsapplicationforpredictinglncrnaproteininteractions AT wangyanbin learningdistributedrepresentationsofrnaandproteinsequencesanditsapplicationforpredictinglncrnaproteininteractions |