Cargando…

Prediction of donor splice sites using random forest with a new sequence encoding approach

BACKGROUND: Detection of splice sites plays a key role for predicting the gene structure and thus development of efficient analytical methods for splice site prediction is vital. This paper presents a novel sequence encoding approach based on the adjacent di-nucleotide dependencies in which the dono...

Descripción completa

Detalles Bibliográficos
Autores principales: Meher, Prabina Kumar, Sahu, Tanmaya Kumar, Rao, Atmakuri Ramakrishna
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4724119/
https://www.ncbi.nlm.nih.gov/pubmed/26807151
http://dx.doi.org/10.1186/s13040-016-0086-4
_version_ 1782411534950989824
author Meher, Prabina Kumar
Sahu, Tanmaya Kumar
Rao, Atmakuri Ramakrishna
author_facet Meher, Prabina Kumar
Sahu, Tanmaya Kumar
Rao, Atmakuri Ramakrishna
author_sort Meher, Prabina Kumar
collection PubMed
description BACKGROUND: Detection of splice sites plays a key role for predicting the gene structure and thus development of efficient analytical methods for splice site prediction is vital. This paper presents a novel sequence encoding approach based on the adjacent di-nucleotide dependencies in which the donor splice site motifs are encoded into numeric vectors. The encoded vectors are then used as input in Random Forest (RF), Support Vector Machines (SVM) and Artificial Neural Network (ANN), Bagging, Boosting, Logistic regression, kNN and Naïve Bayes classifiers for prediction of donor splice sites. RESULTS: The performance of the proposed approach is evaluated on the donor splice site sequence data of Homo sapiens, collected from Homo Sapiens Splice Sites Dataset (HS3D). The results showed that RF outperformed all the considered classifiers. Besides, RF achieved higher prediction accuracy than the existing methods viz., MEM, MDD, WMM, MM1, NNSplice and SpliceView, while compared using an independent test dataset. CONCLUSION: Based on the proposed approach, we have developed an online prediction server (MaLDoSS) to help the biological community in predicting the donor splice sites. The server is made freely available at http://cabgrid.res.in:8080/maldoss. Due to computational feasibility and high prediction accuracy, the proposed approach is believed to help in predicting the eukaryotic gene structure. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13040-016-0086-4) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4724119
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-47241192016-01-24 Prediction of donor splice sites using random forest with a new sequence encoding approach Meher, Prabina Kumar Sahu, Tanmaya Kumar Rao, Atmakuri Ramakrishna BioData Min Methodology BACKGROUND: Detection of splice sites plays a key role for predicting the gene structure and thus development of efficient analytical methods for splice site prediction is vital. This paper presents a novel sequence encoding approach based on the adjacent di-nucleotide dependencies in which the donor splice site motifs are encoded into numeric vectors. The encoded vectors are then used as input in Random Forest (RF), Support Vector Machines (SVM) and Artificial Neural Network (ANN), Bagging, Boosting, Logistic regression, kNN and Naïve Bayes classifiers for prediction of donor splice sites. RESULTS: The performance of the proposed approach is evaluated on the donor splice site sequence data of Homo sapiens, collected from Homo Sapiens Splice Sites Dataset (HS3D). The results showed that RF outperformed all the considered classifiers. Besides, RF achieved higher prediction accuracy than the existing methods viz., MEM, MDD, WMM, MM1, NNSplice and SpliceView, while compared using an independent test dataset. CONCLUSION: Based on the proposed approach, we have developed an online prediction server (MaLDoSS) to help the biological community in predicting the donor splice sites. The server is made freely available at http://cabgrid.res.in:8080/maldoss. Due to computational feasibility and high prediction accuracy, the proposed approach is believed to help in predicting the eukaryotic gene structure. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13040-016-0086-4) contains supplementary material, which is available to authorized users. BioMed Central 2016-01-22 /pmc/articles/PMC4724119/ /pubmed/26807151 http://dx.doi.org/10.1186/s13040-016-0086-4 Text en © Meher et al. 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Methodology
Meher, Prabina Kumar
Sahu, Tanmaya Kumar
Rao, Atmakuri Ramakrishna
Prediction of donor splice sites using random forest with a new sequence encoding approach
title Prediction of donor splice sites using random forest with a new sequence encoding approach
title_full Prediction of donor splice sites using random forest with a new sequence encoding approach
title_fullStr Prediction of donor splice sites using random forest with a new sequence encoding approach
title_full_unstemmed Prediction of donor splice sites using random forest with a new sequence encoding approach
title_short Prediction of donor splice sites using random forest with a new sequence encoding approach
title_sort prediction of donor splice sites using random forest with a new sequence encoding approach
topic Methodology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4724119/
https://www.ncbi.nlm.nih.gov/pubmed/26807151
http://dx.doi.org/10.1186/s13040-016-0086-4
work_keys_str_mv AT meherprabinakumar predictionofdonorsplicesitesusingrandomforestwithanewsequenceencodingapproach
AT sahutanmayakumar predictionofdonorsplicesitesusingrandomforestwithanewsequenceencodingapproach
AT raoatmakuriramakrishna predictionofdonorsplicesitesusingrandomforestwithanewsequenceencodingapproach