Cargando…

A novel riboswitch classification based on imbalanced sequences achieved by machine learning

Riboswitch, a part of regulatory mRNA (50–250nt in length), has two main classes: aptamer and expression platform. One of the main challenges raised during the classification of riboswitch is imbalanced data. That is a circumstance in which the records of a sequences of one group are very small comp...

Descripción completa

Detalles Bibliográficos
Autores principales: Beyene, Solomon Shiferaw, Ling, Tianyi, Ristevski, Blagoj, Chen, Ming
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7392346/
https://www.ncbi.nlm.nih.gov/pubmed/32687488
http://dx.doi.org/10.1371/journal.pcbi.1007760
_version_ 1783564830263738368
author Beyene, Solomon Shiferaw
Ling, Tianyi
Ristevski, Blagoj
Chen, Ming
author_facet Beyene, Solomon Shiferaw
Ling, Tianyi
Ristevski, Blagoj
Chen, Ming
author_sort Beyene, Solomon Shiferaw
collection PubMed
description Riboswitch, a part of regulatory mRNA (50–250nt in length), has two main classes: aptamer and expression platform. One of the main challenges raised during the classification of riboswitch is imbalanced data. That is a circumstance in which the records of a sequences of one group are very small compared to the others. Such circumstances lead classifier to ignore minority group and emphasize on majority ones, which results in a skewed classification. We considered sixteen riboswitch families, to be in accord with recent riboswitch classification work, that contain imbalanced sequences. The sequences were split into training and test set using a newly developed pipeline. From 5460 k-mers (k value 1 to 6) produced, 156 features were calculated based on CfsSubsetEval and BestFirst function found in WEKA 3.8. Statistically tested result was significantly difference between balanced and imbalanced sequences (p < 0.05). Besides, each algorithm also showed a significant difference in sensitivity, specificity, accuracy, and macro F-score when used in both groups (p < 0.05). Several k-mers clustered from heat map were discovered to have biological functions and motifs at the different positions like interior loops, terminal loops and helices. They were validated to have a biological function and some are riboswitch motifs. The analysis has discovered the importance of solving the challenges of majority bias analysis and overfitting. Presented results were generalized evaluation of both balanced and imbalanced models, which implies their ability of classifying, to classify novel riboswitches. The Python source code is available at https://github.com/Seasonsling/riboswitch.
format Online
Article
Text
id pubmed-7392346
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-73923462020-08-12 A novel riboswitch classification based on imbalanced sequences achieved by machine learning Beyene, Solomon Shiferaw Ling, Tianyi Ristevski, Blagoj Chen, Ming PLoS Comput Biol Research Article Riboswitch, a part of regulatory mRNA (50–250nt in length), has two main classes: aptamer and expression platform. One of the main challenges raised during the classification of riboswitch is imbalanced data. That is a circumstance in which the records of a sequences of one group are very small compared to the others. Such circumstances lead classifier to ignore minority group and emphasize on majority ones, which results in a skewed classification. We considered sixteen riboswitch families, to be in accord with recent riboswitch classification work, that contain imbalanced sequences. The sequences were split into training and test set using a newly developed pipeline. From 5460 k-mers (k value 1 to 6) produced, 156 features were calculated based on CfsSubsetEval and BestFirst function found in WEKA 3.8. Statistically tested result was significantly difference between balanced and imbalanced sequences (p < 0.05). Besides, each algorithm also showed a significant difference in sensitivity, specificity, accuracy, and macro F-score when used in both groups (p < 0.05). Several k-mers clustered from heat map were discovered to have biological functions and motifs at the different positions like interior loops, terminal loops and helices. They were validated to have a biological function and some are riboswitch motifs. The analysis has discovered the importance of solving the challenges of majority bias analysis and overfitting. Presented results were generalized evaluation of both balanced and imbalanced models, which implies their ability of classifying, to classify novel riboswitches. The Python source code is available at https://github.com/Seasonsling/riboswitch. Public Library of Science 2020-07-20 /pmc/articles/PMC7392346/ /pubmed/32687488 http://dx.doi.org/10.1371/journal.pcbi.1007760 Text en © 2020 Beyene et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Beyene, Solomon Shiferaw
Ling, Tianyi
Ristevski, Blagoj
Chen, Ming
A novel riboswitch classification based on imbalanced sequences achieved by machine learning
title A novel riboswitch classification based on imbalanced sequences achieved by machine learning
title_full A novel riboswitch classification based on imbalanced sequences achieved by machine learning
title_fullStr A novel riboswitch classification based on imbalanced sequences achieved by machine learning
title_full_unstemmed A novel riboswitch classification based on imbalanced sequences achieved by machine learning
title_short A novel riboswitch classification based on imbalanced sequences achieved by machine learning
title_sort novel riboswitch classification based on imbalanced sequences achieved by machine learning
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7392346/
https://www.ncbi.nlm.nih.gov/pubmed/32687488
http://dx.doi.org/10.1371/journal.pcbi.1007760
work_keys_str_mv AT beyenesolomonshiferaw anovelriboswitchclassificationbasedonimbalancedsequencesachievedbymachinelearning
AT lingtianyi anovelriboswitchclassificationbasedonimbalancedsequencesachievedbymachinelearning
AT ristevskiblagoj anovelriboswitchclassificationbasedonimbalancedsequencesachievedbymachinelearning
AT chenming anovelriboswitchclassificationbasedonimbalancedsequencesachievedbymachinelearning
AT beyenesolomonshiferaw novelriboswitchclassificationbasedonimbalancedsequencesachievedbymachinelearning
AT lingtianyi novelriboswitchclassificationbasedonimbalancedsequencesachievedbymachinelearning
AT ristevskiblagoj novelriboswitchclassificationbasedonimbalancedsequencesachievedbymachinelearning
AT chenming novelriboswitchclassificationbasedonimbalancedsequencesachievedbymachinelearning