Cargando…
Caveats to Deep Learning Approaches to RNA Secondary Structure Prediction
Machine learning (ML) and in particular deep learning techniques have gained popularity for predicting structures from biopolymer sequences. An interesting case is the prediction of RNA secondary structures, where well established biophysics based methods exist. The accuracy of these classical metho...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Frontiers Media S.A.
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9580944/ https://www.ncbi.nlm.nih.gov/pubmed/36304289 http://dx.doi.org/10.3389/fbinf.2022.835422 |
_version_ | 1784812506353500160 |
---|---|
author | Flamm , Christoph Wielach, Julia Wolfinger, Michael T. Badelt, Stefan Lorenz, Ronny Hofacker, Ivo L. |
author_facet | Flamm , Christoph Wielach, Julia Wolfinger, Michael T. Badelt, Stefan Lorenz, Ronny Hofacker, Ivo L. |
author_sort | Flamm , Christoph |
collection | PubMed |
description | Machine learning (ML) and in particular deep learning techniques have gained popularity for predicting structures from biopolymer sequences. An interesting case is the prediction of RNA secondary structures, where well established biophysics based methods exist. The accuracy of these classical methods is limited due to lack of experimental parameters and certain simplifying assumptions and has seen little improvement over the last decade. This makes RNA folding an attractive target for machine learning and consequently several deep learning models have been proposed in recent years. However, for ML approaches to be competitive for de-novo structure prediction, the models must not just demonstrate good phenomenological fits, but be able to learn a (complex) biophysical model. In this contribution we discuss limitations of current approaches, in particular due to biases in the training data. Furthermore, we propose to study capabilities and limitations of ML models by first applying them on synthetic data (obtained from a simplified biophysical model) that can be generated in arbitrary amounts and where all biases can be controlled. We assume that a deep learning model that performs well on these synthetic, would also perform well on real data, and vice versa. We apply this idea by testing several ML models of varying complexity. Finally, we show that the best models are capable of capturing many, but not all, properties of RNA secondary structures. Most severely, the number of predicted base pairs scales quadratically with sequence length, even though a secondary structure can only accommodate a linear number of pairs. |
format | Online Article Text |
id | pubmed-9580944 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Frontiers Media S.A. |
record_format | MEDLINE/PubMed |
spelling | pubmed-95809442022-10-26 Caveats to Deep Learning Approaches to RNA Secondary Structure Prediction Flamm , Christoph Wielach, Julia Wolfinger, Michael T. Badelt, Stefan Lorenz, Ronny Hofacker, Ivo L. Front Bioinform Bioinformatics Machine learning (ML) and in particular deep learning techniques have gained popularity for predicting structures from biopolymer sequences. An interesting case is the prediction of RNA secondary structures, where well established biophysics based methods exist. The accuracy of these classical methods is limited due to lack of experimental parameters and certain simplifying assumptions and has seen little improvement over the last decade. This makes RNA folding an attractive target for machine learning and consequently several deep learning models have been proposed in recent years. However, for ML approaches to be competitive for de-novo structure prediction, the models must not just demonstrate good phenomenological fits, but be able to learn a (complex) biophysical model. In this contribution we discuss limitations of current approaches, in particular due to biases in the training data. Furthermore, we propose to study capabilities and limitations of ML models by first applying them on synthetic data (obtained from a simplified biophysical model) that can be generated in arbitrary amounts and where all biases can be controlled. We assume that a deep learning model that performs well on these synthetic, would also perform well on real data, and vice versa. We apply this idea by testing several ML models of varying complexity. Finally, we show that the best models are capable of capturing many, but not all, properties of RNA secondary structures. Most severely, the number of predicted base pairs scales quadratically with sequence length, even though a secondary structure can only accommodate a linear number of pairs. Frontiers Media S.A. 2022-07-11 /pmc/articles/PMC9580944/ /pubmed/36304289 http://dx.doi.org/10.3389/fbinf.2022.835422 Text en Copyright © 2022 Flamm , Wielach, Wolfinger, Badelt, Lorenz and Hofacker. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms. |
spellingShingle | Bioinformatics Flamm , Christoph Wielach, Julia Wolfinger, Michael T. Badelt, Stefan Lorenz, Ronny Hofacker, Ivo L. Caveats to Deep Learning Approaches to RNA Secondary Structure Prediction |
title | Caveats to Deep Learning Approaches to RNA Secondary Structure Prediction |
title_full | Caveats to Deep Learning Approaches to RNA Secondary Structure Prediction |
title_fullStr | Caveats to Deep Learning Approaches to RNA Secondary Structure Prediction |
title_full_unstemmed | Caveats to Deep Learning Approaches to RNA Secondary Structure Prediction |
title_short | Caveats to Deep Learning Approaches to RNA Secondary Structure Prediction |
title_sort | caveats to deep learning approaches to rna secondary structure prediction |
topic | Bioinformatics |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9580944/ https://www.ncbi.nlm.nih.gov/pubmed/36304289 http://dx.doi.org/10.3389/fbinf.2022.835422 |
work_keys_str_mv | AT flammchristoph caveatstodeeplearningapproachestornasecondarystructureprediction AT wielachjulia caveatstodeeplearningapproachestornasecondarystructureprediction AT wolfingermichaelt caveatstodeeplearningapproachestornasecondarystructureprediction AT badeltstefan caveatstodeeplearningapproachestornasecondarystructureprediction AT lorenzronny caveatstodeeplearningapproachestornasecondarystructureprediction AT hofackerivol caveatstodeeplearningapproachestornasecondarystructureprediction |