Cargando…

Ultrafast end-to-end protein structure prediction enables high-throughput exploration of uncharacterized proteins

Deep learning-based prediction of protein structure usually begins by constructing a multiple sequence alignment (MSA) containing homologs of the target protein. The most successful approaches combine large feature sets derived from MSAs, and considerable computational effort is spent deriving these...

Descripción completa

Detalles Bibliográficos
Autores principales: Kandathil, Shaun M., Greener, Joe G., Lau, Andy M., Jones, David T.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: National Academy of Sciences 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8795500/
https://www.ncbi.nlm.nih.gov/pubmed/35074909
http://dx.doi.org/10.1073/pnas.2113348119
_version_ 1784641079369269248
author Kandathil, Shaun M.
Greener, Joe G.
Lau, Andy M.
Jones, David T.
author_facet Kandathil, Shaun M.
Greener, Joe G.
Lau, Andy M.
Jones, David T.
author_sort Kandathil, Shaun M.
collection PubMed
description Deep learning-based prediction of protein structure usually begins by constructing a multiple sequence alignment (MSA) containing homologs of the target protein. The most successful approaches combine large feature sets derived from MSAs, and considerable computational effort is spent deriving these input features. We present a method that greatly reduces the amount of preprocessing required for a target MSA, while producing main chain coordinates as a direct output of a deep neural network. The network makes use of just three recurrent networks and a stack of residual convolutional layers, making the predictor very fast to run, and easy to install and use. Our approach constructs a directly learned representation of the sequences in an MSA, starting from a one-hot encoding of the sequences. When supplemented with an approximate precision matrix, the learned representation can be used to produce structural models of comparable or greater accuracy as compared to our original DMPfold method, while requiring less than a second to produce a typical model. This level of accuracy and speed allows very large-scale three-dimensional modeling of proteins on minimal hardware, and we demonstrate this by producing models for over 1.3 million uncharacterized regions of proteins extracted from the BFD sequence clusters. After constructing an initial set of approximate models, we select a confident subset of over 30,000 models for further refinement and analysis, revealing putative novel protein folds. We also provide updated models for over 5,000 Pfam families studied in the original DMPfold paper.
format Online
Article
Text
id pubmed-8795500
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher National Academy of Sciences
record_format MEDLINE/PubMed
spelling pubmed-87955002022-07-24 Ultrafast end-to-end protein structure prediction enables high-throughput exploration of uncharacterized proteins Kandathil, Shaun M. Greener, Joe G. Lau, Andy M. Jones, David T. Proc Natl Acad Sci U S A Biological Sciences Deep learning-based prediction of protein structure usually begins by constructing a multiple sequence alignment (MSA) containing homologs of the target protein. The most successful approaches combine large feature sets derived from MSAs, and considerable computational effort is spent deriving these input features. We present a method that greatly reduces the amount of preprocessing required for a target MSA, while producing main chain coordinates as a direct output of a deep neural network. The network makes use of just three recurrent networks and a stack of residual convolutional layers, making the predictor very fast to run, and easy to install and use. Our approach constructs a directly learned representation of the sequences in an MSA, starting from a one-hot encoding of the sequences. When supplemented with an approximate precision matrix, the learned representation can be used to produce structural models of comparable or greater accuracy as compared to our original DMPfold method, while requiring less than a second to produce a typical model. This level of accuracy and speed allows very large-scale three-dimensional modeling of proteins on minimal hardware, and we demonstrate this by producing models for over 1.3 million uncharacterized regions of proteins extracted from the BFD sequence clusters. After constructing an initial set of approximate models, we select a confident subset of over 30,000 models for further refinement and analysis, revealing putative novel protein folds. We also provide updated models for over 5,000 Pfam families studied in the original DMPfold paper. National Academy of Sciences 2022-01-24 2022-01-25 /pmc/articles/PMC8795500/ /pubmed/35074909 http://dx.doi.org/10.1073/pnas.2113348119 Text en Copyright © 2022 the Author(s). Published by PNAS. https://creativecommons.org/licenses/by-nc-nd/4.0/This article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND) (https://creativecommons.org/licenses/by-nc-nd/4.0/) .
spellingShingle Biological Sciences
Kandathil, Shaun M.
Greener, Joe G.
Lau, Andy M.
Jones, David T.
Ultrafast end-to-end protein structure prediction enables high-throughput exploration of uncharacterized proteins
title Ultrafast end-to-end protein structure prediction enables high-throughput exploration of uncharacterized proteins
title_full Ultrafast end-to-end protein structure prediction enables high-throughput exploration of uncharacterized proteins
title_fullStr Ultrafast end-to-end protein structure prediction enables high-throughput exploration of uncharacterized proteins
title_full_unstemmed Ultrafast end-to-end protein structure prediction enables high-throughput exploration of uncharacterized proteins
title_short Ultrafast end-to-end protein structure prediction enables high-throughput exploration of uncharacterized proteins
title_sort ultrafast end-to-end protein structure prediction enables high-throughput exploration of uncharacterized proteins
topic Biological Sciences
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8795500/
https://www.ncbi.nlm.nih.gov/pubmed/35074909
http://dx.doi.org/10.1073/pnas.2113348119
work_keys_str_mv AT kandathilshaunm ultrafastendtoendproteinstructurepredictionenableshighthroughputexplorationofuncharacterizedproteins
AT greenerjoeg ultrafastendtoendproteinstructurepredictionenableshighthroughputexplorationofuncharacterizedproteins
AT lauandym ultrafastendtoendproteinstructurepredictionenableshighthroughputexplorationofuncharacterizedproteins
AT jonesdavidt ultrafastendtoendproteinstructurepredictionenableshighthroughputexplorationofuncharacterizedproteins