Cargando…

Selecting high-quality negative samples for effectively predicting protein-RNA interactions

BACKGROUND: The identification of Protein-RNA Interactions (PRIs) is important to understanding cell activities. Recently, several machine learning-based methods have been developed for identifying PRIs. However, the performance of these methods is unsatisfactory. One major reason is that they usual...

Descripción completa

Detalles Bibliográficos
Autores principales: Cheng, Zhanzhan, Huang, Kai, Wang, Yang, Liu, Hui, Guan, Jihong, Zhou, Shuigeng
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5374704/
https://www.ncbi.nlm.nih.gov/pubmed/28361676
http://dx.doi.org/10.1186/s12918-017-0390-8
_version_ 1782518949328453632
author Cheng, Zhanzhan
Huang, Kai
Wang, Yang
Liu, Hui
Guan, Jihong
Zhou, Shuigeng
author_facet Cheng, Zhanzhan
Huang, Kai
Wang, Yang
Liu, Hui
Guan, Jihong
Zhou, Shuigeng
author_sort Cheng, Zhanzhan
collection PubMed
description BACKGROUND: The identification of Protein-RNA Interactions (PRIs) is important to understanding cell activities. Recently, several machine learning-based methods have been developed for identifying PRIs. However, the performance of these methods is unsatisfactory. One major reason is that they usually use unreliable negative samples in the training process. METHODS: For boosting the performance of PRI prediction, we propose a novel method to generate reliable negative samples. Concretely, we firstly collect the known PRIs as positive samples for generating positive sets. For each positive set, we construct two corresponding negative sets, one is by our method and the other by random method. Each positive set is combined with a negative set to form a dataset for model training and performance evaluation. Consequently, we get 18 datasets of different species and different ratios of negative samples to positive samples. Secondly, sequence-based features are extracted to represent each of PRIs and protein-RNA pairs in the datasets. A filter-based method is employed to cut down the dimensionality of feature vectors for reducing computational cost. Finally, the performance of support vector machine (SVM), random forest (RF) and naive Bayes (NB) is evaluated on the generated 18 datasets. RESULTS: Extensive experiments show that comparing to using randomly-generated negative samples, all classifiers achieve substantial performance improvement by using negative samples selected by our method. The improvements on accuracy and geometric mean for the SVM classifier, the RF classifier and the NB classifier are as high as 204.5 and 68.7%, 174.5 and 53.9%, 80.9 and 54.3%, respectively. CONCLUSION: Our method is useful to the identification of PRIs.
format Online
Article
Text
id pubmed-5374704
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-53747042017-04-03 Selecting high-quality negative samples for effectively predicting protein-RNA interactions Cheng, Zhanzhan Huang, Kai Wang, Yang Liu, Hui Guan, Jihong Zhou, Shuigeng BMC Syst Biol Research BACKGROUND: The identification of Protein-RNA Interactions (PRIs) is important to understanding cell activities. Recently, several machine learning-based methods have been developed for identifying PRIs. However, the performance of these methods is unsatisfactory. One major reason is that they usually use unreliable negative samples in the training process. METHODS: For boosting the performance of PRI prediction, we propose a novel method to generate reliable negative samples. Concretely, we firstly collect the known PRIs as positive samples for generating positive sets. For each positive set, we construct two corresponding negative sets, one is by our method and the other by random method. Each positive set is combined with a negative set to form a dataset for model training and performance evaluation. Consequently, we get 18 datasets of different species and different ratios of negative samples to positive samples. Secondly, sequence-based features are extracted to represent each of PRIs and protein-RNA pairs in the datasets. A filter-based method is employed to cut down the dimensionality of feature vectors for reducing computational cost. Finally, the performance of support vector machine (SVM), random forest (RF) and naive Bayes (NB) is evaluated on the generated 18 datasets. RESULTS: Extensive experiments show that comparing to using randomly-generated negative samples, all classifiers achieve substantial performance improvement by using negative samples selected by our method. The improvements on accuracy and geometric mean for the SVM classifier, the RF classifier and the NB classifier are as high as 204.5 and 68.7%, 174.5 and 53.9%, 80.9 and 54.3%, respectively. CONCLUSION: Our method is useful to the identification of PRIs. BioMed Central 2017-03-14 /pmc/articles/PMC5374704/ /pubmed/28361676 http://dx.doi.org/10.1186/s12918-017-0390-8 Text en © The Author(s) 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Cheng, Zhanzhan
Huang, Kai
Wang, Yang
Liu, Hui
Guan, Jihong
Zhou, Shuigeng
Selecting high-quality negative samples for effectively predicting protein-RNA interactions
title Selecting high-quality negative samples for effectively predicting protein-RNA interactions
title_full Selecting high-quality negative samples for effectively predicting protein-RNA interactions
title_fullStr Selecting high-quality negative samples for effectively predicting protein-RNA interactions
title_full_unstemmed Selecting high-quality negative samples for effectively predicting protein-RNA interactions
title_short Selecting high-quality negative samples for effectively predicting protein-RNA interactions
title_sort selecting high-quality negative samples for effectively predicting protein-rna interactions
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5374704/
https://www.ncbi.nlm.nih.gov/pubmed/28361676
http://dx.doi.org/10.1186/s12918-017-0390-8
work_keys_str_mv AT chengzhanzhan selectinghighqualitynegativesamplesforeffectivelypredictingproteinrnainteractions
AT huangkai selectinghighqualitynegativesamplesforeffectivelypredictingproteinrnainteractions
AT wangyang selectinghighqualitynegativesamplesforeffectivelypredictingproteinrnainteractions
AT liuhui selectinghighqualitynegativesamplesforeffectivelypredictingproteinrnainteractions
AT guanjihong selectinghighqualitynegativesamplesforeffectivelypredictingproteinrnainteractions
AT zhoushuigeng selectinghighqualitynegativesamplesforeffectivelypredictingproteinrnainteractions