Cargando…

Asymmetric trichotomous partitioning overcomes dataset limitations in building machine learning models for predicting siRNA efficacy

Chemically modified small interfering RNAs (siRNAs) are promising therapeutics guiding sequence-specific silencing of disease genes. Identifying chemically modified siRNA sequences that effectively silence target genes remains challenging. Such determinations necessitate computational algorithms. Ma...

Descripción completa

Detalles Bibliográficos
Autores principales: Monopoli, Kathryn R., Korkin, Dmitry, Khvorova, Anastasia
Formato: Online Artículo Texto
Lenguaje:English
Publicado: American Society of Gene & Cell Therapy 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10338369/
https://www.ncbi.nlm.nih.gov/pubmed/37456778
http://dx.doi.org/10.1016/j.omtn.2023.06.010
_version_ 1785071614252023808
author Monopoli, Kathryn R.
Korkin, Dmitry
Khvorova, Anastasia
author_facet Monopoli, Kathryn R.
Korkin, Dmitry
Khvorova, Anastasia
author_sort Monopoli, Kathryn R.
collection PubMed
description Chemically modified small interfering RNAs (siRNAs) are promising therapeutics guiding sequence-specific silencing of disease genes. Identifying chemically modified siRNA sequences that effectively silence target genes remains challenging. Such determinations necessitate computational algorithms. Machine learning is a powerful predictive approach for tackling biological problems but typically requires datasets significantly larger than most available siRNA datasets. Here, we describe a framework applying machine learning to a small dataset (356 modified sequences) for siRNA efficacy prediction. To overcome noise and biological limitations in siRNA datasets, we apply a trichotomous, two-threshold, partitioning approach, producing several combinations of classification threshold pairs. We then test the effects of different thresholds on random forest machine learning model performance using a novel evaluation metric accounting for class imbalances. We identify thresholds yielding a model with high predictive power, outperforming a linear model generated from the same data, that was predictive upon experimental evaluation. Using a novel model feature extraction method, we observe target site base importances and base preferences consistent with our current understanding of the siRNA-mediated silencing mechanism, with the random forest providing higher resolution than the linear model. This framework applies to any classification challenge involving small biological datasets, providing an opportunity to develop high-performing design algorithms for oligonucleotide therapies.
format Online
Article
Text
id pubmed-10338369
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher American Society of Gene & Cell Therapy
record_format MEDLINE/PubMed
spelling pubmed-103383692023-07-14 Asymmetric trichotomous partitioning overcomes dataset limitations in building machine learning models for predicting siRNA efficacy Monopoli, Kathryn R. Korkin, Dmitry Khvorova, Anastasia Mol Ther Nucleic Acids Original Article Chemically modified small interfering RNAs (siRNAs) are promising therapeutics guiding sequence-specific silencing of disease genes. Identifying chemically modified siRNA sequences that effectively silence target genes remains challenging. Such determinations necessitate computational algorithms. Machine learning is a powerful predictive approach for tackling biological problems but typically requires datasets significantly larger than most available siRNA datasets. Here, we describe a framework applying machine learning to a small dataset (356 modified sequences) for siRNA efficacy prediction. To overcome noise and biological limitations in siRNA datasets, we apply a trichotomous, two-threshold, partitioning approach, producing several combinations of classification threshold pairs. We then test the effects of different thresholds on random forest machine learning model performance using a novel evaluation metric accounting for class imbalances. We identify thresholds yielding a model with high predictive power, outperforming a linear model generated from the same data, that was predictive upon experimental evaluation. Using a novel model feature extraction method, we observe target site base importances and base preferences consistent with our current understanding of the siRNA-mediated silencing mechanism, with the random forest providing higher resolution than the linear model. This framework applies to any classification challenge involving small biological datasets, providing an opportunity to develop high-performing design algorithms for oligonucleotide therapies. American Society of Gene & Cell Therapy 2023-06-14 /pmc/articles/PMC10338369/ /pubmed/37456778 http://dx.doi.org/10.1016/j.omtn.2023.06.010 Text en © 2023 The Authors https://creativecommons.org/licenses/by/4.0/This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Original Article
Monopoli, Kathryn R.
Korkin, Dmitry
Khvorova, Anastasia
Asymmetric trichotomous partitioning overcomes dataset limitations in building machine learning models for predicting siRNA efficacy
title Asymmetric trichotomous partitioning overcomes dataset limitations in building machine learning models for predicting siRNA efficacy
title_full Asymmetric trichotomous partitioning overcomes dataset limitations in building machine learning models for predicting siRNA efficacy
title_fullStr Asymmetric trichotomous partitioning overcomes dataset limitations in building machine learning models for predicting siRNA efficacy
title_full_unstemmed Asymmetric trichotomous partitioning overcomes dataset limitations in building machine learning models for predicting siRNA efficacy
title_short Asymmetric trichotomous partitioning overcomes dataset limitations in building machine learning models for predicting siRNA efficacy
title_sort asymmetric trichotomous partitioning overcomes dataset limitations in building machine learning models for predicting sirna efficacy
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10338369/
https://www.ncbi.nlm.nih.gov/pubmed/37456778
http://dx.doi.org/10.1016/j.omtn.2023.06.010
work_keys_str_mv AT monopolikathrynr asymmetrictrichotomouspartitioningovercomesdatasetlimitationsinbuildingmachinelearningmodelsforpredictingsirnaefficacy
AT korkindmitry asymmetrictrichotomouspartitioningovercomesdatasetlimitationsinbuildingmachinelearningmodelsforpredictingsirnaefficacy
AT khvorovaanastasia asymmetrictrichotomouspartitioningovercomesdatasetlimitationsinbuildingmachinelearningmodelsforpredictingsirnaefficacy