Cargando…

Caveats to Deep Learning Approaches to RNA Secondary Structure Prediction

Machine learning (ML) and in particular deep learning techniques have gained popularity for predicting structures from biopolymer sequences. An interesting case is the prediction of RNA secondary structures, where well established biophysics based methods exist. The accuracy of these classical metho...

Descripción completa

Detalles Bibliográficos
Autores principales: Flamm , Christoph, Wielach, Julia, Wolfinger, Michael T., Badelt, Stefan, Lorenz, Ronny, Hofacker, Ivo L.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9580944/
https://www.ncbi.nlm.nih.gov/pubmed/36304289
http://dx.doi.org/10.3389/fbinf.2022.835422
_version_ 1784812506353500160
author Flamm , Christoph
Wielach, Julia
Wolfinger, Michael T.
Badelt, Stefan
Lorenz, Ronny
Hofacker, Ivo L.
author_facet Flamm , Christoph
Wielach, Julia
Wolfinger, Michael T.
Badelt, Stefan
Lorenz, Ronny
Hofacker, Ivo L.
author_sort Flamm , Christoph
collection PubMed
description Machine learning (ML) and in particular deep learning techniques have gained popularity for predicting structures from biopolymer sequences. An interesting case is the prediction of RNA secondary structures, where well established biophysics based methods exist. The accuracy of these classical methods is limited due to lack of experimental parameters and certain simplifying assumptions and has seen little improvement over the last decade. This makes RNA folding an attractive target for machine learning and consequently several deep learning models have been proposed in recent years. However, for ML approaches to be competitive for de-novo structure prediction, the models must not just demonstrate good phenomenological fits, but be able to learn a (complex) biophysical model. In this contribution we discuss limitations of current approaches, in particular due to biases in the training data. Furthermore, we propose to study capabilities and limitations of ML models by first applying them on synthetic data (obtained from a simplified biophysical model) that can be generated in arbitrary amounts and where all biases can be controlled. We assume that a deep learning model that performs well on these synthetic, would also perform well on real data, and vice versa. We apply this idea by testing several ML models of varying complexity. Finally, we show that the best models are capable of capturing many, but not all, properties of RNA secondary structures. Most severely, the number of predicted base pairs scales quadratically with sequence length, even though a secondary structure can only accommodate a linear number of pairs.
format Online
Article
Text
id pubmed-9580944
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-95809442022-10-26 Caveats to Deep Learning Approaches to RNA Secondary Structure Prediction Flamm , Christoph Wielach, Julia Wolfinger, Michael T. Badelt, Stefan Lorenz, Ronny Hofacker, Ivo L. Front Bioinform Bioinformatics Machine learning (ML) and in particular deep learning techniques have gained popularity for predicting structures from biopolymer sequences. An interesting case is the prediction of RNA secondary structures, where well established biophysics based methods exist. The accuracy of these classical methods is limited due to lack of experimental parameters and certain simplifying assumptions and has seen little improvement over the last decade. This makes RNA folding an attractive target for machine learning and consequently several deep learning models have been proposed in recent years. However, for ML approaches to be competitive for de-novo structure prediction, the models must not just demonstrate good phenomenological fits, but be able to learn a (complex) biophysical model. In this contribution we discuss limitations of current approaches, in particular due to biases in the training data. Furthermore, we propose to study capabilities and limitations of ML models by first applying them on synthetic data (obtained from a simplified biophysical model) that can be generated in arbitrary amounts and where all biases can be controlled. We assume that a deep learning model that performs well on these synthetic, would also perform well on real data, and vice versa. We apply this idea by testing several ML models of varying complexity. Finally, we show that the best models are capable of capturing many, but not all, properties of RNA secondary structures. Most severely, the number of predicted base pairs scales quadratically with sequence length, even though a secondary structure can only accommodate a linear number of pairs. Frontiers Media S.A. 2022-07-11 /pmc/articles/PMC9580944/ /pubmed/36304289 http://dx.doi.org/10.3389/fbinf.2022.835422 Text en Copyright © 2022 Flamm , Wielach, Wolfinger, Badelt, Lorenz and Hofacker. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Bioinformatics
Flamm , Christoph
Wielach, Julia
Wolfinger, Michael T.
Badelt, Stefan
Lorenz, Ronny
Hofacker, Ivo L.
Caveats to Deep Learning Approaches to RNA Secondary Structure Prediction
title Caveats to Deep Learning Approaches to RNA Secondary Structure Prediction
title_full Caveats to Deep Learning Approaches to RNA Secondary Structure Prediction
title_fullStr Caveats to Deep Learning Approaches to RNA Secondary Structure Prediction
title_full_unstemmed Caveats to Deep Learning Approaches to RNA Secondary Structure Prediction
title_short Caveats to Deep Learning Approaches to RNA Secondary Structure Prediction
title_sort caveats to deep learning approaches to rna secondary structure prediction
topic Bioinformatics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9580944/
https://www.ncbi.nlm.nih.gov/pubmed/36304289
http://dx.doi.org/10.3389/fbinf.2022.835422
work_keys_str_mv AT flammchristoph caveatstodeeplearningapproachestornasecondarystructureprediction
AT wielachjulia caveatstodeeplearningapproachestornasecondarystructureprediction
AT wolfingermichaelt caveatstodeeplearningapproachestornasecondarystructureprediction
AT badeltstefan caveatstodeeplearningapproachestornasecondarystructureprediction
AT lorenzronny caveatstodeeplearningapproachestornasecondarystructureprediction
AT hofackerivol caveatstodeeplearningapproachestornasecondarystructureprediction