Cargando…
Probing an optimal class distribution for enhancing prediction and feature characterization of plant virus-encoded RNA-silencing suppressors
To counter the host RNA silencing defense mechanism, many plant viruses encode RNA silencing suppressor proteins. These groups of proteins share very low sequence and structural similarities among them, which consequently hamper their annotation using sequence similarity-based search methods. Altern...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Springer Berlin Heidelberg
2016
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4801844/ https://www.ncbi.nlm.nih.gov/pubmed/28330163 http://dx.doi.org/10.1007/s13205-016-0410-1 |
_version_ | 1782422625256996864 |
---|---|
author | Nath, Abhigyan Subbiah, Karthikeyan |
author_facet | Nath, Abhigyan Subbiah, Karthikeyan |
author_sort | Nath, Abhigyan |
collection | PubMed |
description | To counter the host RNA silencing defense mechanism, many plant viruses encode RNA silencing suppressor proteins. These groups of proteins share very low sequence and structural similarities among them, which consequently hamper their annotation using sequence similarity-based search methods. Alternatively the machine learning-based methods can become a suitable choice, but the optimal performance through machine learning-based methods is being affected by various factors such as class imbalance, incomplete learning, selection of inappropriate features, etc. In this paper, we have proposed a novel approach to deal with the class imbalance problem by finding the optimal class distribution for enhancing the prediction accuracy for the RNA silencing suppressors. The optimal class distribution was obtained using different resampling techniques with varying degrees of class distribution starting from natural distribution to ideal distribution, i.e., equal distribution. The experimental results support the fact that optimal class distribution plays an important role to achieve near perfect learning. The best prediction results are obtained with Sequential Minimal Optimization (SMO) learning algorithm. We could achieve a sensitivity of 98.5 %, specificity of 92.6 % with an overall accuracy of 95.3 % on a tenfold cross validation and is further validated using leave one out cross validation test. It was also observed that the machine learning models trained on oversampled training sets using synthetic minority oversampling technique (SMOTE) have relatively performed better than on both randomly undersampled and imbalanced training data sets. Further, we have characterized the important discriminatory sequence features of RNA-silencing suppressors which distinguish these groups of proteins from other protein families. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1007/s13205-016-0410-1) contains supplementary material, which is available to authorized users. |
format | Online Article Text |
id | pubmed-4801844 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2016 |
publisher | Springer Berlin Heidelberg |
record_format | MEDLINE/PubMed |
spelling | pubmed-48018442016-04-11 Probing an optimal class distribution for enhancing prediction and feature characterization of plant virus-encoded RNA-silencing suppressors Nath, Abhigyan Subbiah, Karthikeyan 3 Biotech Original Article To counter the host RNA silencing defense mechanism, many plant viruses encode RNA silencing suppressor proteins. These groups of proteins share very low sequence and structural similarities among them, which consequently hamper their annotation using sequence similarity-based search methods. Alternatively the machine learning-based methods can become a suitable choice, but the optimal performance through machine learning-based methods is being affected by various factors such as class imbalance, incomplete learning, selection of inappropriate features, etc. In this paper, we have proposed a novel approach to deal with the class imbalance problem by finding the optimal class distribution for enhancing the prediction accuracy for the RNA silencing suppressors. The optimal class distribution was obtained using different resampling techniques with varying degrees of class distribution starting from natural distribution to ideal distribution, i.e., equal distribution. The experimental results support the fact that optimal class distribution plays an important role to achieve near perfect learning. The best prediction results are obtained with Sequential Minimal Optimization (SMO) learning algorithm. We could achieve a sensitivity of 98.5 %, specificity of 92.6 % with an overall accuracy of 95.3 % on a tenfold cross validation and is further validated using leave one out cross validation test. It was also observed that the machine learning models trained on oversampled training sets using synthetic minority oversampling technique (SMOTE) have relatively performed better than on both randomly undersampled and imbalanced training data sets. Further, we have characterized the important discriminatory sequence features of RNA-silencing suppressors which distinguish these groups of proteins from other protein families. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1007/s13205-016-0410-1) contains supplementary material, which is available to authorized users. Springer Berlin Heidelberg 2016-03-21 2016-06 /pmc/articles/PMC4801844/ /pubmed/28330163 http://dx.doi.org/10.1007/s13205-016-0410-1 Text en © The Author(s) 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. |
spellingShingle | Original Article Nath, Abhigyan Subbiah, Karthikeyan Probing an optimal class distribution for enhancing prediction and feature characterization of plant virus-encoded RNA-silencing suppressors |
title | Probing an optimal class distribution for enhancing prediction and feature characterization of plant virus-encoded RNA-silencing suppressors |
title_full | Probing an optimal class distribution for enhancing prediction and feature characterization of plant virus-encoded RNA-silencing suppressors |
title_fullStr | Probing an optimal class distribution for enhancing prediction and feature characterization of plant virus-encoded RNA-silencing suppressors |
title_full_unstemmed | Probing an optimal class distribution for enhancing prediction and feature characterization of plant virus-encoded RNA-silencing suppressors |
title_short | Probing an optimal class distribution for enhancing prediction and feature characterization of plant virus-encoded RNA-silencing suppressors |
title_sort | probing an optimal class distribution for enhancing prediction and feature characterization of plant virus-encoded rna-silencing suppressors |
topic | Original Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4801844/ https://www.ncbi.nlm.nih.gov/pubmed/28330163 http://dx.doi.org/10.1007/s13205-016-0410-1 |
work_keys_str_mv | AT nathabhigyan probinganoptimalclassdistributionforenhancingpredictionandfeaturecharacterizationofplantvirusencodedrnasilencingsuppressors AT subbiahkarthikeyan probinganoptimalclassdistributionforenhancingpredictionandfeaturecharacterizationofplantvirusencodedrnasilencingsuppressors |