Cargando…
Identification of non-coding RNAs with a new composite feature in the Hybrid Random Forest Ensemble algorithm
To identify non-coding RNA (ncRNA) signals within genomic regions, a classification tool was developed based on a hybrid random forest (RF) with a logistic regression model to efficiently discriminate short ncRNA sequences as well as long complex ncRNA sequences. This RF-based classifier was trained...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2014
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4066759/ https://www.ncbi.nlm.nih.gov/pubmed/24771344 http://dx.doi.org/10.1093/nar/gku325 |
_version_ | 1782322209696514048 |
---|---|
author | Lertampaiporn, Supatcha Thammarongtham, Chinae Nukoolkit, Chakarida Kaewkamnerdpong, Boonserm Ruengjitchatchawalya, Marasri |
author_facet | Lertampaiporn, Supatcha Thammarongtham, Chinae Nukoolkit, Chakarida Kaewkamnerdpong, Boonserm Ruengjitchatchawalya, Marasri |
author_sort | Lertampaiporn, Supatcha |
collection | PubMed |
description | To identify non-coding RNA (ncRNA) signals within genomic regions, a classification tool was developed based on a hybrid random forest (RF) with a logistic regression model to efficiently discriminate short ncRNA sequences as well as long complex ncRNA sequences. This RF-based classifier was trained on a well-balanced dataset with a discriminative set of features and achieved an accuracy, sensitivity and specificity of 92.11%, 90.7% and 93.5%, respectively. The selected feature set includes a new proposed feature, SCORE. This feature is generated based on a logistic regression function that combines five significant features—structure, sequence, modularity, structural robustness and coding potential—to enable improved characterization of long ncRNA (lncRNA) elements. The use of SCORE improved the performance of the RF-based classifier in the identification of Rfam lncRNA families. A genome-wide ncRNA classification framework was applied to a wide variety of organisms, with an emphasis on those of economic, social, public health, environmental and agricultural significance, such as various bacteria genomes, the Arthrospira (Spirulina) genome, and rice and human genomic regions. Our framework was able to identify known ncRNAs with sensitivities of greater than 90% and 77.7% for prokaryotic and eukaryotic sequences, respectively. Our classifier is available at http://ncrna-pred.com/HLRF.htm. |
format | Online Article Text |
id | pubmed-4066759 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2014 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-40667592014-06-24 Identification of non-coding RNAs with a new composite feature in the Hybrid Random Forest Ensemble algorithm Lertampaiporn, Supatcha Thammarongtham, Chinae Nukoolkit, Chakarida Kaewkamnerdpong, Boonserm Ruengjitchatchawalya, Marasri Nucleic Acids Res Methods Online To identify non-coding RNA (ncRNA) signals within genomic regions, a classification tool was developed based on a hybrid random forest (RF) with a logistic regression model to efficiently discriminate short ncRNA sequences as well as long complex ncRNA sequences. This RF-based classifier was trained on a well-balanced dataset with a discriminative set of features and achieved an accuracy, sensitivity and specificity of 92.11%, 90.7% and 93.5%, respectively. The selected feature set includes a new proposed feature, SCORE. This feature is generated based on a logistic regression function that combines five significant features—structure, sequence, modularity, structural robustness and coding potential—to enable improved characterization of long ncRNA (lncRNA) elements. The use of SCORE improved the performance of the RF-based classifier in the identification of Rfam lncRNA families. A genome-wide ncRNA classification framework was applied to a wide variety of organisms, with an emphasis on those of economic, social, public health, environmental and agricultural significance, such as various bacteria genomes, the Arthrospira (Spirulina) genome, and rice and human genomic regions. Our framework was able to identify known ncRNAs with sensitivities of greater than 90% and 77.7% for prokaryotic and eukaryotic sequences, respectively. Our classifier is available at http://ncrna-pred.com/HLRF.htm. Oxford University Press 2014-07-01 2014-04-25 /pmc/articles/PMC4066759/ /pubmed/24771344 http://dx.doi.org/10.1093/nar/gku325 Text en © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/3.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com |
spellingShingle | Methods Online Lertampaiporn, Supatcha Thammarongtham, Chinae Nukoolkit, Chakarida Kaewkamnerdpong, Boonserm Ruengjitchatchawalya, Marasri Identification of non-coding RNAs with a new composite feature in the Hybrid Random Forest Ensemble algorithm |
title | Identification of non-coding RNAs with a new composite feature in the Hybrid Random Forest Ensemble algorithm |
title_full | Identification of non-coding RNAs with a new composite feature in the Hybrid Random Forest Ensemble algorithm |
title_fullStr | Identification of non-coding RNAs with a new composite feature in the Hybrid Random Forest Ensemble algorithm |
title_full_unstemmed | Identification of non-coding RNAs with a new composite feature in the Hybrid Random Forest Ensemble algorithm |
title_short | Identification of non-coding RNAs with a new composite feature in the Hybrid Random Forest Ensemble algorithm |
title_sort | identification of non-coding rnas with a new composite feature in the hybrid random forest ensemble algorithm |
topic | Methods Online |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4066759/ https://www.ncbi.nlm.nih.gov/pubmed/24771344 http://dx.doi.org/10.1093/nar/gku325 |
work_keys_str_mv | AT lertampaipornsupatcha identificationofnoncodingrnaswithanewcompositefeatureinthehybridrandomforestensemblealgorithm AT thammarongthamchinae identificationofnoncodingrnaswithanewcompositefeatureinthehybridrandomforestensemblealgorithm AT nukoolkitchakarida identificationofnoncodingrnaswithanewcompositefeatureinthehybridrandomforestensemblealgorithm AT kaewkamnerdpongboonserm identificationofnoncodingrnaswithanewcompositefeatureinthehybridrandomforestensemblealgorithm AT ruengjitchatchawalyamarasri identificationofnoncodingrnaswithanewcompositefeatureinthehybridrandomforestensemblealgorithm |