Cargando…

Prediction of plant promoters based on hexamers and random triplet pair analysis

BACKGROUND: With an increasing number of plant genome sequences, it has become important to develop a robust computational method for detecting plant promoters. Although a wide variety of programs are currently available, prediction accuracy of these still requires further improvement. The limitatio...

Descripción completa

Detalles Bibliográficos
Autores principales: Azad, A K M, Shahid, Saima, Noman, Nasimul, Lee, Hyunju
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3160368/
https://www.ncbi.nlm.nih.gov/pubmed/21711543
http://dx.doi.org/10.1186/1748-7188-6-19
_version_ 1782210545931255808
author Azad, A K M
Shahid, Saima
Noman, Nasimul
Lee, Hyunju
author_facet Azad, A K M
Shahid, Saima
Noman, Nasimul
Lee, Hyunju
author_sort Azad, A K M
collection PubMed
description BACKGROUND: With an increasing number of plant genome sequences, it has become important to develop a robust computational method for detecting plant promoters. Although a wide variety of programs are currently available, prediction accuracy of these still requires further improvement. The limitations of these methods can be addressed by selecting appropriate features for distinguishing promoters and non-promoters. METHODS: In this study, we proposed two feature selection approaches based on hexamer sequences: the Frequency Distribution Analyzed Feature Selection Algorithm (FDAFSA) and the Random Triplet Pair Feature Selecting Genetic Algorithm (RTPFSGA). In FDAFSA, adjacent triplet-pairs (hexamer sequences) were selected based on the difference in the frequency of hexamers between promoters and non-promoters. In RTPFSGA, random triplet-pairs (RTPs) were selected by exploiting a genetic algorithm that distinguishes frequencies of non-adjacent triplet pairs between promoters and non-promoters. Then, a support vector machine (SVM), a nonlinear machine-learning algorithm, was used to classify promoters and non-promoters by combining these two feature selection approaches. We referred to this novel algorithm as PromoBot. RESULTS: Promoter sequences were collected from the PlantProm database. Non-promoter sequences were collected from plant mRNA, rRNA, and tRNA of PlantGDB and plant miRNA of miRBase. Then, in order to validate the proposed algorithm, we applied a 5-fold cross validation test. Training data sets were used to select features based on FDAFSA and RTPFSGA, and these features were used to train the SVM. We achieved 89% sensitivity and 86% specificity. CONCLUSIONS: We compared our PromoBot algorithm to five other algorithms. It was found that the sensitivity and specificity of PromoBot performed well (or even better) with the algorithms tested. These results show that the two proposed feature selection methods based on hexamer frequencies and random triplet-pair could be successfully incorporated into a supervised machine learning method in promoter classification problem. As such, we expect that PromoBot can be used to help identify new plant promoters. Source codes and analysis results of this work could be provided upon request.
format Online
Article
Text
id pubmed-3160368
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-31603682011-08-24 Prediction of plant promoters based on hexamers and random triplet pair analysis Azad, A K M Shahid, Saima Noman, Nasimul Lee, Hyunju Algorithms Mol Biol Research BACKGROUND: With an increasing number of plant genome sequences, it has become important to develop a robust computational method for detecting plant promoters. Although a wide variety of programs are currently available, prediction accuracy of these still requires further improvement. The limitations of these methods can be addressed by selecting appropriate features for distinguishing promoters and non-promoters. METHODS: In this study, we proposed two feature selection approaches based on hexamer sequences: the Frequency Distribution Analyzed Feature Selection Algorithm (FDAFSA) and the Random Triplet Pair Feature Selecting Genetic Algorithm (RTPFSGA). In FDAFSA, adjacent triplet-pairs (hexamer sequences) were selected based on the difference in the frequency of hexamers between promoters and non-promoters. In RTPFSGA, random triplet-pairs (RTPs) were selected by exploiting a genetic algorithm that distinguishes frequencies of non-adjacent triplet pairs between promoters and non-promoters. Then, a support vector machine (SVM), a nonlinear machine-learning algorithm, was used to classify promoters and non-promoters by combining these two feature selection approaches. We referred to this novel algorithm as PromoBot. RESULTS: Promoter sequences were collected from the PlantProm database. Non-promoter sequences were collected from plant mRNA, rRNA, and tRNA of PlantGDB and plant miRNA of miRBase. Then, in order to validate the proposed algorithm, we applied a 5-fold cross validation test. Training data sets were used to select features based on FDAFSA and RTPFSGA, and these features were used to train the SVM. We achieved 89% sensitivity and 86% specificity. CONCLUSIONS: We compared our PromoBot algorithm to five other algorithms. It was found that the sensitivity and specificity of PromoBot performed well (or even better) with the algorithms tested. These results show that the two proposed feature selection methods based on hexamer frequencies and random triplet-pair could be successfully incorporated into a supervised machine learning method in promoter classification problem. As such, we expect that PromoBot can be used to help identify new plant promoters. Source codes and analysis results of this work could be provided upon request. BioMed Central 2011-06-28 /pmc/articles/PMC3160368/ /pubmed/21711543 http://dx.doi.org/10.1186/1748-7188-6-19 Text en Copyright ©2011 Azad et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Azad, A K M
Shahid, Saima
Noman, Nasimul
Lee, Hyunju
Prediction of plant promoters based on hexamers and random triplet pair analysis
title Prediction of plant promoters based on hexamers and random triplet pair analysis
title_full Prediction of plant promoters based on hexamers and random triplet pair analysis
title_fullStr Prediction of plant promoters based on hexamers and random triplet pair analysis
title_full_unstemmed Prediction of plant promoters based on hexamers and random triplet pair analysis
title_short Prediction of plant promoters based on hexamers and random triplet pair analysis
title_sort prediction of plant promoters based on hexamers and random triplet pair analysis
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3160368/
https://www.ncbi.nlm.nih.gov/pubmed/21711543
http://dx.doi.org/10.1186/1748-7188-6-19
work_keys_str_mv AT azadakm predictionofplantpromotersbasedonhexamersandrandomtripletpairanalysis
AT shahidsaima predictionofplantpromotersbasedonhexamersandrandomtripletpairanalysis
AT nomannasimul predictionofplantpromotersbasedonhexamersandrandomtripletpairanalysis
AT leehyunju predictionofplantpromotersbasedonhexamersandrandomtripletpairanalysis