Cargando…

Prediction of Bacterial sRNAs Using Sequence-Derived Features and Machine Learning

Small ribonucleic acid (sRNA) sequences are 50–500 nucleotide long, noncoding RNA (ncRNA) sequences that play an important role in regulating transcription and translation within a bacterial cell. As such, identifying sRNA sequences within an organism’s genome is essential to understand the impact o...

Descripción completa

Detalles Bibliográficos
Autores principales:	Jha, Tony, Mendel, Jovinna, Cho, Hyuk, Choudhary, Madhusudan
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	SAGE Publications 2022
Materias:	Original Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9397377/ https://www.ncbi.nlm.nih.gov/pubmed/36016866 http://dx.doi.org/10.1177/11779322221118335

_version_	1784772113037524992
author	Jha, Tony Mendel, Jovinna Cho, Hyuk Choudhary, Madhusudan
author_facet	Jha, Tony Mendel, Jovinna Cho, Hyuk Choudhary, Madhusudan
author_sort	Jha, Tony
collection	PubMed
description	Small ribonucleic acid (sRNA) sequences are 50–500 nucleotide long, noncoding RNA (ncRNA) sequences that play an important role in regulating transcription and translation within a bacterial cell. As such, identifying sRNA sequences within an organism’s genome is essential to understand the impact of the RNA molecules on cellular processes. Recently, numerous machine learning models have been applied to predict sRNAs within bacterial genomes. In this study, we considered the sRNA prediction as an imbalanced binary classification problem to distinguish minor positive sRNAs from major negative ones within imbalanced data and then performed a comparative study with six learning algorithms and seven assessment metrics. First, we collected numerical feature groups extracted from known sRNAs previously identified in Salmonella typhimurium LT2 (SLT2) and Escherichia coli K12 (E. coli K12) genomes. Second, as a preliminary study, we characterized the sRNA-size distribution with the conformity test for Benford’s law. Third, we applied six traditional classification algorithms to sRNA features and assessed classification performance with seven metrics, varying positive-to-negative instance ratios, and utilizing stratified 10-fold cross-validation. We revisited important individual features and feature groups and found that classification with combined features perform better than with either an individual feature or a single feature group in terms of Area Under Precision-Recall curve (AUPR). We reconfirmed that AUPR properly measures classification performance on imbalanced data with varying imbalance ratios, which is consistent with previous studies on classification metrics for imbalanced data. Overall, eXtreme Gradient Boosting (XGBoost), even without exploiting optimal hyperparameter values, performed better than the other five algorithms with specific optimal parameter settings. As a future work, we plan to extend XGBoost further to a large amount of published sRNAs in bacterial genomes and compare its classification performance with recent machine learning models’ performance.
format	Online Article Text
id	pubmed-9397377
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	SAGE Publications
record_format	MEDLINE/PubMed
spelling	pubmed-93973772022-08-24 Prediction of Bacterial sRNAs Using Sequence-Derived Features and Machine Learning Jha, Tony Mendel, Jovinna Cho, Hyuk Choudhary, Madhusudan Bioinform Biol Insights Original Research Article Small ribonucleic acid (sRNA) sequences are 50–500 nucleotide long, noncoding RNA (ncRNA) sequences that play an important role in regulating transcription and translation within a bacterial cell. As such, identifying sRNA sequences within an organism’s genome is essential to understand the impact of the RNA molecules on cellular processes. Recently, numerous machine learning models have been applied to predict sRNAs within bacterial genomes. In this study, we considered the sRNA prediction as an imbalanced binary classification problem to distinguish minor positive sRNAs from major negative ones within imbalanced data and then performed a comparative study with six learning algorithms and seven assessment metrics. First, we collected numerical feature groups extracted from known sRNAs previously identified in Salmonella typhimurium LT2 (SLT2) and Escherichia coli K12 (E. coli K12) genomes. Second, as a preliminary study, we characterized the sRNA-size distribution with the conformity test for Benford’s law. Third, we applied six traditional classification algorithms to sRNA features and assessed classification performance with seven metrics, varying positive-to-negative instance ratios, and utilizing stratified 10-fold cross-validation. We revisited important individual features and feature groups and found that classification with combined features perform better than with either an individual feature or a single feature group in terms of Area Under Precision-Recall curve (AUPR). We reconfirmed that AUPR properly measures classification performance on imbalanced data with varying imbalance ratios, which is consistent with previous studies on classification metrics for imbalanced data. Overall, eXtreme Gradient Boosting (XGBoost), even without exploiting optimal hyperparameter values, performed better than the other five algorithms with specific optimal parameter settings. As a future work, we plan to extend XGBoost further to a large amount of published sRNAs in bacterial genomes and compare its classification performance with recent machine learning models’ performance. SAGE Publications 2022-08-18 /pmc/articles/PMC9397377/ /pubmed/36016866 http://dx.doi.org/10.1177/11779322221118335 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by-nc/4.0/This article is distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 License (https://creativecommons.org/licenses/by-nc/4.0/) which permits non-commercial use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access page (https://us.sagepub.com/en-us/nam/open-access-at-sage).
spellingShingle	Original Research Article Jha, Tony Mendel, Jovinna Cho, Hyuk Choudhary, Madhusudan Prediction of Bacterial sRNAs Using Sequence-Derived Features and Machine Learning
title	Prediction of Bacterial sRNAs Using Sequence-Derived Features and Machine Learning
title_full	Prediction of Bacterial sRNAs Using Sequence-Derived Features and Machine Learning
title_fullStr	Prediction of Bacterial sRNAs Using Sequence-Derived Features and Machine Learning
title_full_unstemmed	Prediction of Bacterial sRNAs Using Sequence-Derived Features and Machine Learning
title_short	Prediction of Bacterial sRNAs Using Sequence-Derived Features and Machine Learning
title_sort	prediction of bacterial srnas using sequence-derived features and machine learning
topic	Original Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9397377/ https://www.ncbi.nlm.nih.gov/pubmed/36016866 http://dx.doi.org/10.1177/11779322221118335
work_keys_str_mv	AT jhatony predictionofbacterialsrnasusingsequencederivedfeaturesandmachinelearning AT mendeljovinna predictionofbacterialsrnasusingsequencederivedfeaturesandmachinelearning AT chohyuk predictionofbacterialsrnasusingsequencederivedfeaturesandmachinelearning AT choudharymadhusudan predictionofbacterialsrnasusingsequencederivedfeaturesandmachinelearning

Prediction of Bacterial sRNAs Using Sequence-Derived Features and Machine Learning

Ejemplares similares