Cargando…
Selecting high-quality negative samples for effectively predicting protein-RNA interactions
BACKGROUND: The identification of Protein-RNA Interactions (PRIs) is important to understanding cell activities. Recently, several machine learning-based methods have been developed for identifying PRIs. However, the performance of these methods is unsatisfactory. One major reason is that they usual...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2017
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5374704/ https://www.ncbi.nlm.nih.gov/pubmed/28361676 http://dx.doi.org/10.1186/s12918-017-0390-8 |
_version_ | 1782518949328453632 |
---|---|
author | Cheng, Zhanzhan Huang, Kai Wang, Yang Liu, Hui Guan, Jihong Zhou, Shuigeng |
author_facet | Cheng, Zhanzhan Huang, Kai Wang, Yang Liu, Hui Guan, Jihong Zhou, Shuigeng |
author_sort | Cheng, Zhanzhan |
collection | PubMed |
description | BACKGROUND: The identification of Protein-RNA Interactions (PRIs) is important to understanding cell activities. Recently, several machine learning-based methods have been developed for identifying PRIs. However, the performance of these methods is unsatisfactory. One major reason is that they usually use unreliable negative samples in the training process. METHODS: For boosting the performance of PRI prediction, we propose a novel method to generate reliable negative samples. Concretely, we firstly collect the known PRIs as positive samples for generating positive sets. For each positive set, we construct two corresponding negative sets, one is by our method and the other by random method. Each positive set is combined with a negative set to form a dataset for model training and performance evaluation. Consequently, we get 18 datasets of different species and different ratios of negative samples to positive samples. Secondly, sequence-based features are extracted to represent each of PRIs and protein-RNA pairs in the datasets. A filter-based method is employed to cut down the dimensionality of feature vectors for reducing computational cost. Finally, the performance of support vector machine (SVM), random forest (RF) and naive Bayes (NB) is evaluated on the generated 18 datasets. RESULTS: Extensive experiments show that comparing to using randomly-generated negative samples, all classifiers achieve substantial performance improvement by using negative samples selected by our method. The improvements on accuracy and geometric mean for the SVM classifier, the RF classifier and the NB classifier are as high as 204.5 and 68.7%, 174.5 and 53.9%, 80.9 and 54.3%, respectively. CONCLUSION: Our method is useful to the identification of PRIs. |
format | Online Article Text |
id | pubmed-5374704 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2017 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-53747042017-04-03 Selecting high-quality negative samples for effectively predicting protein-RNA interactions Cheng, Zhanzhan Huang, Kai Wang, Yang Liu, Hui Guan, Jihong Zhou, Shuigeng BMC Syst Biol Research BACKGROUND: The identification of Protein-RNA Interactions (PRIs) is important to understanding cell activities. Recently, several machine learning-based methods have been developed for identifying PRIs. However, the performance of these methods is unsatisfactory. One major reason is that they usually use unreliable negative samples in the training process. METHODS: For boosting the performance of PRI prediction, we propose a novel method to generate reliable negative samples. Concretely, we firstly collect the known PRIs as positive samples for generating positive sets. For each positive set, we construct two corresponding negative sets, one is by our method and the other by random method. Each positive set is combined with a negative set to form a dataset for model training and performance evaluation. Consequently, we get 18 datasets of different species and different ratios of negative samples to positive samples. Secondly, sequence-based features are extracted to represent each of PRIs and protein-RNA pairs in the datasets. A filter-based method is employed to cut down the dimensionality of feature vectors for reducing computational cost. Finally, the performance of support vector machine (SVM), random forest (RF) and naive Bayes (NB) is evaluated on the generated 18 datasets. RESULTS: Extensive experiments show that comparing to using randomly-generated negative samples, all classifiers achieve substantial performance improvement by using negative samples selected by our method. The improvements on accuracy and geometric mean for the SVM classifier, the RF classifier and the NB classifier are as high as 204.5 and 68.7%, 174.5 and 53.9%, 80.9 and 54.3%, respectively. CONCLUSION: Our method is useful to the identification of PRIs. BioMed Central 2017-03-14 /pmc/articles/PMC5374704/ /pubmed/28361676 http://dx.doi.org/10.1186/s12918-017-0390-8 Text en © The Author(s) 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Cheng, Zhanzhan Huang, Kai Wang, Yang Liu, Hui Guan, Jihong Zhou, Shuigeng Selecting high-quality negative samples for effectively predicting protein-RNA interactions |
title | Selecting high-quality negative samples for effectively predicting protein-RNA interactions |
title_full | Selecting high-quality negative samples for effectively predicting protein-RNA interactions |
title_fullStr | Selecting high-quality negative samples for effectively predicting protein-RNA interactions |
title_full_unstemmed | Selecting high-quality negative samples for effectively predicting protein-RNA interactions |
title_short | Selecting high-quality negative samples for effectively predicting protein-RNA interactions |
title_sort | selecting high-quality negative samples for effectively predicting protein-rna interactions |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5374704/ https://www.ncbi.nlm.nih.gov/pubmed/28361676 http://dx.doi.org/10.1186/s12918-017-0390-8 |
work_keys_str_mv | AT chengzhanzhan selectinghighqualitynegativesamplesforeffectivelypredictingproteinrnainteractions AT huangkai selectinghighqualitynegativesamplesforeffectivelypredictingproteinrnainteractions AT wangyang selectinghighqualitynegativesamplesforeffectivelypredictingproteinrnainteractions AT liuhui selectinghighqualitynegativesamplesforeffectivelypredictingproteinrnainteractions AT guanjihong selectinghighqualitynegativesamplesforeffectivelypredictingproteinrnainteractions AT zhoushuigeng selectinghighqualitynegativesamplesforeffectivelypredictingproteinrnainteractions |