Cargando…

Prediction of Bacterial sRNAs Using Sequence-Derived Features and Machine Learning

Small ribonucleic acid (sRNA) sequences are 50–500 nucleotide long, noncoding RNA (ncRNA) sequences that play an important role in regulating transcription and translation within a bacterial cell. As such, identifying sRNA sequences within an organism’s genome is essential to understand the impact o...

Descripción completa

Detalles Bibliográficos
Autores principales: Jha, Tony, Mendel, Jovinna, Cho, Hyuk, Choudhary, Madhusudan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: SAGE Publications 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9397377/
https://www.ncbi.nlm.nih.gov/pubmed/36016866
http://dx.doi.org/10.1177/11779322221118335
_version_ 1784772113037524992
author Jha, Tony
Mendel, Jovinna
Cho, Hyuk
Choudhary, Madhusudan
author_facet Jha, Tony
Mendel, Jovinna
Cho, Hyuk
Choudhary, Madhusudan
author_sort Jha, Tony
collection PubMed
description Small ribonucleic acid (sRNA) sequences are 50–500 nucleotide long, noncoding RNA (ncRNA) sequences that play an important role in regulating transcription and translation within a bacterial cell. As such, identifying sRNA sequences within an organism’s genome is essential to understand the impact of the RNA molecules on cellular processes. Recently, numerous machine learning models have been applied to predict sRNAs within bacterial genomes. In this study, we considered the sRNA prediction as an imbalanced binary classification problem to distinguish minor positive sRNAs from major negative ones within imbalanced data and then performed a comparative study with six learning algorithms and seven assessment metrics. First, we collected numerical feature groups extracted from known sRNAs previously identified in Salmonella typhimurium LT2 (SLT2) and Escherichia coli K12 (E. coli K12) genomes. Second, as a preliminary study, we characterized the sRNA-size distribution with the conformity test for Benford’s law. Third, we applied six traditional classification algorithms to sRNA features and assessed classification performance with seven metrics, varying positive-to-negative instance ratios, and utilizing stratified 10-fold cross-validation. We revisited important individual features and feature groups and found that classification with combined features perform better than with either an individual feature or a single feature group in terms of Area Under Precision-Recall curve (AUPR). We reconfirmed that AUPR properly measures classification performance on imbalanced data with varying imbalance ratios, which is consistent with previous studies on classification metrics for imbalanced data. Overall, eXtreme Gradient Boosting (XGBoost), even without exploiting optimal hyperparameter values, performed better than the other five algorithms with specific optimal parameter settings. As a future work, we plan to extend XGBoost further to a large amount of published sRNAs in bacterial genomes and compare its classification performance with recent machine learning models’ performance.
format Online
Article
Text
id pubmed-9397377
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher SAGE Publications
record_format MEDLINE/PubMed
spelling pubmed-93973772022-08-24 Prediction of Bacterial sRNAs Using Sequence-Derived Features and Machine Learning Jha, Tony Mendel, Jovinna Cho, Hyuk Choudhary, Madhusudan Bioinform Biol Insights Original Research Article Small ribonucleic acid (sRNA) sequences are 50–500 nucleotide long, noncoding RNA (ncRNA) sequences that play an important role in regulating transcription and translation within a bacterial cell. As such, identifying sRNA sequences within an organism’s genome is essential to understand the impact of the RNA molecules on cellular processes. Recently, numerous machine learning models have been applied to predict sRNAs within bacterial genomes. In this study, we considered the sRNA prediction as an imbalanced binary classification problem to distinguish minor positive sRNAs from major negative ones within imbalanced data and then performed a comparative study with six learning algorithms and seven assessment metrics. First, we collected numerical feature groups extracted from known sRNAs previously identified in Salmonella typhimurium LT2 (SLT2) and Escherichia coli K12 (E. coli K12) genomes. Second, as a preliminary study, we characterized the sRNA-size distribution with the conformity test for Benford’s law. Third, we applied six traditional classification algorithms to sRNA features and assessed classification performance with seven metrics, varying positive-to-negative instance ratios, and utilizing stratified 10-fold cross-validation. We revisited important individual features and feature groups and found that classification with combined features perform better than with either an individual feature or a single feature group in terms of Area Under Precision-Recall curve (AUPR). We reconfirmed that AUPR properly measures classification performance on imbalanced data with varying imbalance ratios, which is consistent with previous studies on classification metrics for imbalanced data. Overall, eXtreme Gradient Boosting (XGBoost), even without exploiting optimal hyperparameter values, performed better than the other five algorithms with specific optimal parameter settings. As a future work, we plan to extend XGBoost further to a large amount of published sRNAs in bacterial genomes and compare its classification performance with recent machine learning models’ performance. SAGE Publications 2022-08-18 /pmc/articles/PMC9397377/ /pubmed/36016866 http://dx.doi.org/10.1177/11779322221118335 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by-nc/4.0/This article is distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 License (https://creativecommons.org/licenses/by-nc/4.0/) which permits non-commercial use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access page (https://us.sagepub.com/en-us/nam/open-access-at-sage).
spellingShingle Original Research Article
Jha, Tony
Mendel, Jovinna
Cho, Hyuk
Choudhary, Madhusudan
Prediction of Bacterial sRNAs Using Sequence-Derived Features and Machine Learning
title Prediction of Bacterial sRNAs Using Sequence-Derived Features and Machine Learning
title_full Prediction of Bacterial sRNAs Using Sequence-Derived Features and Machine Learning
title_fullStr Prediction of Bacterial sRNAs Using Sequence-Derived Features and Machine Learning
title_full_unstemmed Prediction of Bacterial sRNAs Using Sequence-Derived Features and Machine Learning
title_short Prediction of Bacterial sRNAs Using Sequence-Derived Features and Machine Learning
title_sort prediction of bacterial srnas using sequence-derived features and machine learning
topic Original Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9397377/
https://www.ncbi.nlm.nih.gov/pubmed/36016866
http://dx.doi.org/10.1177/11779322221118335
work_keys_str_mv AT jhatony predictionofbacterialsrnasusingsequencederivedfeaturesandmachinelearning
AT mendeljovinna predictionofbacterialsrnasusingsequencederivedfeaturesandmachinelearning
AT chohyuk predictionofbacterialsrnasusingsequencederivedfeaturesandmachinelearning
AT choudharymadhusudan predictionofbacterialsrnasusingsequencederivedfeaturesandmachinelearning