Cargando…

Selecting high-quality negative samples for effectively predicting protein-RNA interactions

BACKGROUND: The identification of Protein-RNA Interactions (PRIs) is important to understanding cell activities. Recently, several machine learning-based methods have been developed for identifying PRIs. However, the performance of these methods is unsatisfactory. One major reason is that they usual...

Descripción completa

Detalles Bibliográficos
Autores principales:	Cheng, Zhanzhan, Huang, Kai, Wang, Yang, Liu, Hui, Guan, Jihong, Zhou, Shuigeng
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2017
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5374704/ https://www.ncbi.nlm.nih.gov/pubmed/28361676 http://dx.doi.org/10.1186/s12918-017-0390-8

_version_	1782518949328453632
author	Cheng, Zhanzhan Huang, Kai Wang, Yang Liu, Hui Guan, Jihong Zhou, Shuigeng
author_facet	Cheng, Zhanzhan Huang, Kai Wang, Yang Liu, Hui Guan, Jihong Zhou, Shuigeng
author_sort	Cheng, Zhanzhan
collection	PubMed
description	BACKGROUND: The identification of Protein-RNA Interactions (PRIs) is important to understanding cell activities. Recently, several machine learning-based methods have been developed for identifying PRIs. However, the performance of these methods is unsatisfactory. One major reason is that they usually use unreliable negative samples in the training process. METHODS: For boosting the performance of PRI prediction, we propose a novel method to generate reliable negative samples. Concretely, we firstly collect the known PRIs as positive samples for generating positive sets. For each positive set, we construct two corresponding negative sets, one is by our method and the other by random method. Each positive set is combined with a negative set to form a dataset for model training and performance evaluation. Consequently, we get 18 datasets of different species and different ratios of negative samples to positive samples. Secondly, sequence-based features are extracted to represent each of PRIs and protein-RNA pairs in the datasets. A filter-based method is employed to cut down the dimensionality of feature vectors for reducing computational cost. Finally, the performance of support vector machine (SVM), random forest (RF) and naive Bayes (NB) is evaluated on the generated 18 datasets. RESULTS: Extensive experiments show that comparing to using randomly-generated negative samples, all classifiers achieve substantial performance improvement by using negative samples selected by our method. The improvements on accuracy and geometric mean for the SVM classifier, the RF classifier and the NB classifier are as high as 204.5 and 68.7%, 174.5 and 53.9%, 80.9 and 54.3%, respectively. CONCLUSION: Our method is useful to the identification of PRIs.
format	Online Article Text
id	pubmed-5374704
institution	National Center for Biotechnology Information
language	English
publishDate	2017
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-53747042017-04-03 Selecting high-quality negative samples for effectively predicting protein-RNA interactions Cheng, Zhanzhan Huang, Kai Wang, Yang Liu, Hui Guan, Jihong Zhou, Shuigeng BMC Syst Biol Research BACKGROUND: The identification of Protein-RNA Interactions (PRIs) is important to understanding cell activities. Recently, several machine learning-based methods have been developed for identifying PRIs. However, the performance of these methods is unsatisfactory. One major reason is that they usually use unreliable negative samples in the training process. METHODS: For boosting the performance of PRI prediction, we propose a novel method to generate reliable negative samples. Concretely, we firstly collect the known PRIs as positive samples for generating positive sets. For each positive set, we construct two corresponding negative sets, one is by our method and the other by random method. Each positive set is combined with a negative set to form a dataset for model training and performance evaluation. Consequently, we get 18 datasets of different species and different ratios of negative samples to positive samples. Secondly, sequence-based features are extracted to represent each of PRIs and protein-RNA pairs in the datasets. A filter-based method is employed to cut down the dimensionality of feature vectors for reducing computational cost. Finally, the performance of support vector machine (SVM), random forest (RF) and naive Bayes (NB) is evaluated on the generated 18 datasets. RESULTS: Extensive experiments show that comparing to using randomly-generated negative samples, all classifiers achieve substantial performance improvement by using negative samples selected by our method. The improvements on accuracy and geometric mean for the SVM classifier, the RF classifier and the NB classifier are as high as 204.5 and 68.7%, 174.5 and 53.9%, 80.9 and 54.3%, respectively. CONCLUSION: Our method is useful to the identification of PRIs. BioMed Central 2017-03-14 /pmc/articles/PMC5374704/ /pubmed/28361676 http://dx.doi.org/10.1186/s12918-017-0390-8 Text en © The Author(s) 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Cheng, Zhanzhan Huang, Kai Wang, Yang Liu, Hui Guan, Jihong Zhou, Shuigeng Selecting high-quality negative samples for effectively predicting protein-RNA interactions
title	Selecting high-quality negative samples for effectively predicting protein-RNA interactions
title_full	Selecting high-quality negative samples for effectively predicting protein-RNA interactions
title_fullStr	Selecting high-quality negative samples for effectively predicting protein-RNA interactions
title_full_unstemmed	Selecting high-quality negative samples for effectively predicting protein-RNA interactions
title_short	Selecting high-quality negative samples for effectively predicting protein-RNA interactions
title_sort	selecting high-quality negative samples for effectively predicting protein-rna interactions
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5374704/ https://www.ncbi.nlm.nih.gov/pubmed/28361676 http://dx.doi.org/10.1186/s12918-017-0390-8
work_keys_str_mv	AT chengzhanzhan selectinghighqualitynegativesamplesforeffectivelypredictingproteinrnainteractions AT huangkai selectinghighqualitynegativesamplesforeffectivelypredictingproteinrnainteractions AT wangyang selectinghighqualitynegativesamplesforeffectivelypredictingproteinrnainteractions AT liuhui selectinghighqualitynegativesamplesforeffectivelypredictingproteinrnainteractions AT guanjihong selectinghighqualitynegativesamplesforeffectivelypredictingproteinrnainteractions AT zhoushuigeng selectinghighqualitynegativesamplesforeffectivelypredictingproteinrnainteractions

Selecting high-quality negative samples for effectively predicting protein-RNA interactions

Ejemplares similares