Cargando…

Improving deep models of protein-coding potential with a Fourier-transform architecture and machine translation task

Ribosomes are information-processing macromolecular machines that integrate complex sequence patterns in messenger RNA (mRNA) transcripts to synthesize proteins. Studies of the sequence features that distinguish mRNAs from long noncoding RNAs (lncRNAs) may yield insight into the information that dir...

Descripción completa

Detalles Bibliográficos
Autores principales:	Valencia, Joseph D., Hendrix, David A.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Cold Spring Harbor Laboratory 2023
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10104019/ https://www.ncbi.nlm.nih.gov/pubmed/37066250 http://dx.doi.org/10.1101/2023.04.03.535488

_version_	1785025956462723072
author	Valencia, Joseph D. Hendrix, David A.
author_facet	Valencia, Joseph D. Hendrix, David A.
author_sort	Valencia, Joseph D.
collection	PubMed
description	Ribosomes are information-processing macromolecular machines that integrate complex sequence patterns in messenger RNA (mRNA) transcripts to synthesize proteins. Studies of the sequence features that distinguish mRNAs from long noncoding RNAs (lncRNAs) may yield insight into the information that directs and regulates translation. Computational methods for calculating protein-coding potential are important for distinguishing mRNAs from lncRNAs during genome annotation, but most machine learning methods for this task rely on previously known rules to define features. Sequence-to-sequence (seq2seq) models, particularly ones using transformer networks, have proven capable of learning complex grammatical relationships between words to perform natural language translation. Seeking to leverage these advancements in the biological domain, we present a seq2seq formulation for predicting protein-coding potential with deep neural networks and demonstrate that simultaneously learning translation from RNA to protein improves classification performance relative to a classification-only training objective. Inspired by classical signal processing methods for gene discovery and Fourier-based image-processing neural networks, we introduce LocalFilterNet (LFNet). LFNet is a network architecture with an inductive bias for modeling the three-nucleotide periodicity apparent in coding sequences. We incorporate LFNet within an encoder-decoder framework to test whether the translation task improves the classification of transcripts and the interpretation of their sequence features. We use the resulting model to compute nucleotide-resolution importance scores, revealing sequence patterns that could assist the cellular machinery in distinguishing mRNAs and lncRNAs. Finally, we develop a novel approach for estimating mutation effects from Integrated Gradients, a backpropagation-based feature attribution, and characterize the difficulty of efficient approximations in this setting.
format	Online Article Text
id	pubmed-10104019
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Cold Spring Harbor Laboratory
record_format	MEDLINE/PubMed
spelling	pubmed-101040192023-04-15 Improving deep models of protein-coding potential with a Fourier-transform architecture and machine translation task Valencia, Joseph D. Hendrix, David A. bioRxiv Article Ribosomes are information-processing macromolecular machines that integrate complex sequence patterns in messenger RNA (mRNA) transcripts to synthesize proteins. Studies of the sequence features that distinguish mRNAs from long noncoding RNAs (lncRNAs) may yield insight into the information that directs and regulates translation. Computational methods for calculating protein-coding potential are important for distinguishing mRNAs from lncRNAs during genome annotation, but most machine learning methods for this task rely on previously known rules to define features. Sequence-to-sequence (seq2seq) models, particularly ones using transformer networks, have proven capable of learning complex grammatical relationships between words to perform natural language translation. Seeking to leverage these advancements in the biological domain, we present a seq2seq formulation for predicting protein-coding potential with deep neural networks and demonstrate that simultaneously learning translation from RNA to protein improves classification performance relative to a classification-only training objective. Inspired by classical signal processing methods for gene discovery and Fourier-based image-processing neural networks, we introduce LocalFilterNet (LFNet). LFNet is a network architecture with an inductive bias for modeling the three-nucleotide periodicity apparent in coding sequences. We incorporate LFNet within an encoder-decoder framework to test whether the translation task improves the classification of transcripts and the interpretation of their sequence features. We use the resulting model to compute nucleotide-resolution importance scores, revealing sequence patterns that could assist the cellular machinery in distinguishing mRNAs and lncRNAs. Finally, we develop a novel approach for estimating mutation effects from Integrated Gradients, a backpropagation-based feature attribution, and characterize the difficulty of efficient approximations in this setting. Cold Spring Harbor Laboratory 2023-04-19 /pmc/articles/PMC10104019/ /pubmed/37066250 http://dx.doi.org/10.1101/2023.04.03.535488 Text en https://creativecommons.org/licenses/by-nc/4.0/This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (https://creativecommons.org/licenses/by-nc/4.0/) , which allows reusers to distribute, remix, adapt, and build upon the material in any medium or format for noncommercial purposes only, and only so long as attribution is given to the creator.
spellingShingle	Article Valencia, Joseph D. Hendrix, David A. Improving deep models of protein-coding potential with a Fourier-transform architecture and machine translation task
title	Improving deep models of protein-coding potential with a Fourier-transform architecture and machine translation task
title_full	Improving deep models of protein-coding potential with a Fourier-transform architecture and machine translation task
title_fullStr	Improving deep models of protein-coding potential with a Fourier-transform architecture and machine translation task
title_full_unstemmed	Improving deep models of protein-coding potential with a Fourier-transform architecture and machine translation task
title_short	Improving deep models of protein-coding potential with a Fourier-transform architecture and machine translation task
title_sort	improving deep models of protein-coding potential with a fourier-transform architecture and machine translation task
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10104019/ https://www.ncbi.nlm.nih.gov/pubmed/37066250 http://dx.doi.org/10.1101/2023.04.03.535488
work_keys_str_mv	AT valenciajosephd improvingdeepmodelsofproteincodingpotentialwithafouriertransformarchitectureandmachinetranslationtask AT hendrixdavida improvingdeepmodelsofproteincodingpotentialwithafouriertransformarchitectureandmachinetranslationtask

Improving deep models of protein-coding potential with a Fourier-transform architecture and machine translation task

Ejemplares similares