Cargando…

Identification of non-coding RNAs with a new composite feature in the Hybrid Random Forest Ensemble algorithm

To identify non-coding RNA (ncRNA) signals within genomic regions, a classification tool was developed based on a hybrid random forest (RF) with a logistic regression model to efficiently discriminate short ncRNA sequences as well as long complex ncRNA sequences. This RF-based classifier was trained...

Descripción completa

Detalles Bibliográficos
Autores principales: Lertampaiporn, Supatcha, Thammarongtham, Chinae, Nukoolkit, Chakarida, Kaewkamnerdpong, Boonserm, Ruengjitchatchawalya, Marasri
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4066759/
https://www.ncbi.nlm.nih.gov/pubmed/24771344
http://dx.doi.org/10.1093/nar/gku325
_version_ 1782322209696514048
author Lertampaiporn, Supatcha
Thammarongtham, Chinae
Nukoolkit, Chakarida
Kaewkamnerdpong, Boonserm
Ruengjitchatchawalya, Marasri
author_facet Lertampaiporn, Supatcha
Thammarongtham, Chinae
Nukoolkit, Chakarida
Kaewkamnerdpong, Boonserm
Ruengjitchatchawalya, Marasri
author_sort Lertampaiporn, Supatcha
collection PubMed
description To identify non-coding RNA (ncRNA) signals within genomic regions, a classification tool was developed based on a hybrid random forest (RF) with a logistic regression model to efficiently discriminate short ncRNA sequences as well as long complex ncRNA sequences. This RF-based classifier was trained on a well-balanced dataset with a discriminative set of features and achieved an accuracy, sensitivity and specificity of 92.11%, 90.7% and 93.5%, respectively. The selected feature set includes a new proposed feature, SCORE. This feature is generated based on a logistic regression function that combines five significant features—structure, sequence, modularity, structural robustness and coding potential—to enable improved characterization of long ncRNA (lncRNA) elements. The use of SCORE improved the performance of the RF-based classifier in the identification of Rfam lncRNA families. A genome-wide ncRNA classification framework was applied to a wide variety of organisms, with an emphasis on those of economic, social, public health, environmental and agricultural significance, such as various bacteria genomes, the Arthrospira (Spirulina) genome, and rice and human genomic regions. Our framework was able to identify known ncRNAs with sensitivities of greater than 90% and 77.7% for prokaryotic and eukaryotic sequences, respectively. Our classifier is available at http://ncrna-pred.com/HLRF.htm.
format Online
Article
Text
id pubmed-4066759
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-40667592014-06-24 Identification of non-coding RNAs with a new composite feature in the Hybrid Random Forest Ensemble algorithm Lertampaiporn, Supatcha Thammarongtham, Chinae Nukoolkit, Chakarida Kaewkamnerdpong, Boonserm Ruengjitchatchawalya, Marasri Nucleic Acids Res Methods Online To identify non-coding RNA (ncRNA) signals within genomic regions, a classification tool was developed based on a hybrid random forest (RF) with a logistic regression model to efficiently discriminate short ncRNA sequences as well as long complex ncRNA sequences. This RF-based classifier was trained on a well-balanced dataset with a discriminative set of features and achieved an accuracy, sensitivity and specificity of 92.11%, 90.7% and 93.5%, respectively. The selected feature set includes a new proposed feature, SCORE. This feature is generated based on a logistic regression function that combines five significant features—structure, sequence, modularity, structural robustness and coding potential—to enable improved characterization of long ncRNA (lncRNA) elements. The use of SCORE improved the performance of the RF-based classifier in the identification of Rfam lncRNA families. A genome-wide ncRNA classification framework was applied to a wide variety of organisms, with an emphasis on those of economic, social, public health, environmental and agricultural significance, such as various bacteria genomes, the Arthrospira (Spirulina) genome, and rice and human genomic regions. Our framework was able to identify known ncRNAs with sensitivities of greater than 90% and 77.7% for prokaryotic and eukaryotic sequences, respectively. Our classifier is available at http://ncrna-pred.com/HLRF.htm. Oxford University Press 2014-07-01 2014-04-25 /pmc/articles/PMC4066759/ /pubmed/24771344 http://dx.doi.org/10.1093/nar/gku325 Text en © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/3.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Methods Online
Lertampaiporn, Supatcha
Thammarongtham, Chinae
Nukoolkit, Chakarida
Kaewkamnerdpong, Boonserm
Ruengjitchatchawalya, Marasri
Identification of non-coding RNAs with a new composite feature in the Hybrid Random Forest Ensemble algorithm
title Identification of non-coding RNAs with a new composite feature in the Hybrid Random Forest Ensemble algorithm
title_full Identification of non-coding RNAs with a new composite feature in the Hybrid Random Forest Ensemble algorithm
title_fullStr Identification of non-coding RNAs with a new composite feature in the Hybrid Random Forest Ensemble algorithm
title_full_unstemmed Identification of non-coding RNAs with a new composite feature in the Hybrid Random Forest Ensemble algorithm
title_short Identification of non-coding RNAs with a new composite feature in the Hybrid Random Forest Ensemble algorithm
title_sort identification of non-coding rnas with a new composite feature in the hybrid random forest ensemble algorithm
topic Methods Online
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4066759/
https://www.ncbi.nlm.nih.gov/pubmed/24771344
http://dx.doi.org/10.1093/nar/gku325
work_keys_str_mv AT lertampaipornsupatcha identificationofnoncodingrnaswithanewcompositefeatureinthehybridrandomforestensemblealgorithm
AT thammarongthamchinae identificationofnoncodingrnaswithanewcompositefeatureinthehybridrandomforestensemblealgorithm
AT nukoolkitchakarida identificationofnoncodingrnaswithanewcompositefeatureinthehybridrandomforestensemblealgorithm
AT kaewkamnerdpongboonserm identificationofnoncodingrnaswithanewcompositefeatureinthehybridrandomforestensemblealgorithm
AT ruengjitchatchawalyamarasri identificationofnoncodingrnaswithanewcompositefeatureinthehybridrandomforestensemblealgorithm