Cargando…

Identification of donor splice sites using support vector machine: a computational approach based on positional, compositional and dependency features

BACKGROUND: Identification of splice sites is essential for annotation of genes. Though existing approaches have achieved an acceptable level of accuracy, still there is a need for further improvement. Besides, most of the approaches are species-specific and hence it is required to develop approache...

Descripción completa

Detalles Bibliográficos
Autores principales: Meher, Prabina Kumar, Sahu, Tanmaya Kumar, Rao, A. R., Wahi, S. D.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4888255/
https://www.ncbi.nlm.nih.gov/pubmed/27252772
http://dx.doi.org/10.1186/s13015-016-0078-4
_version_ 1782434834925223936
author Meher, Prabina Kumar
Sahu, Tanmaya Kumar
Rao, A. R.
Wahi, S. D.
author_facet Meher, Prabina Kumar
Sahu, Tanmaya Kumar
Rao, A. R.
Wahi, S. D.
author_sort Meher, Prabina Kumar
collection PubMed
description BACKGROUND: Identification of splice sites is essential for annotation of genes. Though existing approaches have achieved an acceptable level of accuracy, still there is a need for further improvement. Besides, most of the approaches are species-specific and hence it is required to develop approaches compatible across species. RESULTS: Each splice site sequence was transformed into a numeric vector of length 49, out of which four were positional, four were dependency and 41 were compositional features. Using the transformed vectors as input, prediction was made through support vector machine. Using balanced training set, the proposed approach achieved area under ROC curve (AUC-ROC) of 96.05, 96.96, 96.95, 96.24 % and area under PR curve (AUC-PR) of 97.64, 97.89, 97.91, 97.90 %, while tested on human, cattle, fish and worm datasets respectively. On the other hand, AUC-ROC of 97.21, 97.45, 97.41, 98.06 % and AUC-PR of 93.24, 93.34, 93.38, 92.29 % were obtained, while imbalanced training datasets were used. The proposed approach was found comparable with state-of-art splice site prediction approaches, while compared using the bench mark NN269 dataset and other datasets. CONCLUSIONS: The proposed approach achieved consistent accuracy across different species as well as found comparable with the existing approaches. Thus, we believe that the proposed approach can be used as a complementary method to the existing methods for the prediction of splice sites. A web server named as ‘HSplice’ has also been developed based on the proposed approach for easy prediction of 5′ splice sites by the users and is freely available at http://cabgrid.res.in:8080/HSplice.
format Online
Article
Text
id pubmed-4888255
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-48882552016-06-02 Identification of donor splice sites using support vector machine: a computational approach based on positional, compositional and dependency features Meher, Prabina Kumar Sahu, Tanmaya Kumar Rao, A. R. Wahi, S. D. Algorithms Mol Biol Research BACKGROUND: Identification of splice sites is essential for annotation of genes. Though existing approaches have achieved an acceptable level of accuracy, still there is a need for further improvement. Besides, most of the approaches are species-specific and hence it is required to develop approaches compatible across species. RESULTS: Each splice site sequence was transformed into a numeric vector of length 49, out of which four were positional, four were dependency and 41 were compositional features. Using the transformed vectors as input, prediction was made through support vector machine. Using balanced training set, the proposed approach achieved area under ROC curve (AUC-ROC) of 96.05, 96.96, 96.95, 96.24 % and area under PR curve (AUC-PR) of 97.64, 97.89, 97.91, 97.90 %, while tested on human, cattle, fish and worm datasets respectively. On the other hand, AUC-ROC of 97.21, 97.45, 97.41, 98.06 % and AUC-PR of 93.24, 93.34, 93.38, 92.29 % were obtained, while imbalanced training datasets were used. The proposed approach was found comparable with state-of-art splice site prediction approaches, while compared using the bench mark NN269 dataset and other datasets. CONCLUSIONS: The proposed approach achieved consistent accuracy across different species as well as found comparable with the existing approaches. Thus, we believe that the proposed approach can be used as a complementary method to the existing methods for the prediction of splice sites. A web server named as ‘HSplice’ has also been developed based on the proposed approach for easy prediction of 5′ splice sites by the users and is freely available at http://cabgrid.res.in:8080/HSplice. BioMed Central 2016-06-01 /pmc/articles/PMC4888255/ /pubmed/27252772 http://dx.doi.org/10.1186/s13015-016-0078-4 Text en © The Author(s) 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Meher, Prabina Kumar
Sahu, Tanmaya Kumar
Rao, A. R.
Wahi, S. D.
Identification of donor splice sites using support vector machine: a computational approach based on positional, compositional and dependency features
title Identification of donor splice sites using support vector machine: a computational approach based on positional, compositional and dependency features
title_full Identification of donor splice sites using support vector machine: a computational approach based on positional, compositional and dependency features
title_fullStr Identification of donor splice sites using support vector machine: a computational approach based on positional, compositional and dependency features
title_full_unstemmed Identification of donor splice sites using support vector machine: a computational approach based on positional, compositional and dependency features
title_short Identification of donor splice sites using support vector machine: a computational approach based on positional, compositional and dependency features
title_sort identification of donor splice sites using support vector machine: a computational approach based on positional, compositional and dependency features
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4888255/
https://www.ncbi.nlm.nih.gov/pubmed/27252772
http://dx.doi.org/10.1186/s13015-016-0078-4
work_keys_str_mv AT meherprabinakumar identificationofdonorsplicesitesusingsupportvectormachineacomputationalapproachbasedonpositionalcompositionalanddependencyfeatures
AT sahutanmayakumar identificationofdonorsplicesitesusingsupportvectormachineacomputationalapproachbasedonpositionalcompositionalanddependencyfeatures
AT raoar identificationofdonorsplicesitesusingsupportvectormachineacomputationalapproachbasedonpositionalcompositionalanddependencyfeatures
AT wahisd identificationofdonorsplicesitesusingsupportvectormachineacomputationalapproachbasedonpositionalcompositionalanddependencyfeatures