Cargando…
Ultrafast end-to-end protein structure prediction enables high-throughput exploration of uncharacterized proteins
Deep learning-based prediction of protein structure usually begins by constructing a multiple sequence alignment (MSA) containing homologs of the target protein. The most successful approaches combine large feature sets derived from MSAs, and considerable computational effort is spent deriving these...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
National Academy of Sciences
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8795500/ https://www.ncbi.nlm.nih.gov/pubmed/35074909 http://dx.doi.org/10.1073/pnas.2113348119 |
_version_ | 1784641079369269248 |
---|---|
author | Kandathil, Shaun M. Greener, Joe G. Lau, Andy M. Jones, David T. |
author_facet | Kandathil, Shaun M. Greener, Joe G. Lau, Andy M. Jones, David T. |
author_sort | Kandathil, Shaun M. |
collection | PubMed |
description | Deep learning-based prediction of protein structure usually begins by constructing a multiple sequence alignment (MSA) containing homologs of the target protein. The most successful approaches combine large feature sets derived from MSAs, and considerable computational effort is spent deriving these input features. We present a method that greatly reduces the amount of preprocessing required for a target MSA, while producing main chain coordinates as a direct output of a deep neural network. The network makes use of just three recurrent networks and a stack of residual convolutional layers, making the predictor very fast to run, and easy to install and use. Our approach constructs a directly learned representation of the sequences in an MSA, starting from a one-hot encoding of the sequences. When supplemented with an approximate precision matrix, the learned representation can be used to produce structural models of comparable or greater accuracy as compared to our original DMPfold method, while requiring less than a second to produce a typical model. This level of accuracy and speed allows very large-scale three-dimensional modeling of proteins on minimal hardware, and we demonstrate this by producing models for over 1.3 million uncharacterized regions of proteins extracted from the BFD sequence clusters. After constructing an initial set of approximate models, we select a confident subset of over 30,000 models for further refinement and analysis, revealing putative novel protein folds. We also provide updated models for over 5,000 Pfam families studied in the original DMPfold paper. |
format | Online Article Text |
id | pubmed-8795500 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | National Academy of Sciences |
record_format | MEDLINE/PubMed |
spelling | pubmed-87955002022-07-24 Ultrafast end-to-end protein structure prediction enables high-throughput exploration of uncharacterized proteins Kandathil, Shaun M. Greener, Joe G. Lau, Andy M. Jones, David T. Proc Natl Acad Sci U S A Biological Sciences Deep learning-based prediction of protein structure usually begins by constructing a multiple sequence alignment (MSA) containing homologs of the target protein. The most successful approaches combine large feature sets derived from MSAs, and considerable computational effort is spent deriving these input features. We present a method that greatly reduces the amount of preprocessing required for a target MSA, while producing main chain coordinates as a direct output of a deep neural network. The network makes use of just three recurrent networks and a stack of residual convolutional layers, making the predictor very fast to run, and easy to install and use. Our approach constructs a directly learned representation of the sequences in an MSA, starting from a one-hot encoding of the sequences. When supplemented with an approximate precision matrix, the learned representation can be used to produce structural models of comparable or greater accuracy as compared to our original DMPfold method, while requiring less than a second to produce a typical model. This level of accuracy and speed allows very large-scale three-dimensional modeling of proteins on minimal hardware, and we demonstrate this by producing models for over 1.3 million uncharacterized regions of proteins extracted from the BFD sequence clusters. After constructing an initial set of approximate models, we select a confident subset of over 30,000 models for further refinement and analysis, revealing putative novel protein folds. We also provide updated models for over 5,000 Pfam families studied in the original DMPfold paper. National Academy of Sciences 2022-01-24 2022-01-25 /pmc/articles/PMC8795500/ /pubmed/35074909 http://dx.doi.org/10.1073/pnas.2113348119 Text en Copyright © 2022 the Author(s). Published by PNAS. https://creativecommons.org/licenses/by-nc-nd/4.0/This article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND) (https://creativecommons.org/licenses/by-nc-nd/4.0/) . |
spellingShingle | Biological Sciences Kandathil, Shaun M. Greener, Joe G. Lau, Andy M. Jones, David T. Ultrafast end-to-end protein structure prediction enables high-throughput exploration of uncharacterized proteins |
title | Ultrafast end-to-end protein structure prediction enables high-throughput exploration of uncharacterized proteins |
title_full | Ultrafast end-to-end protein structure prediction enables high-throughput exploration of uncharacterized proteins |
title_fullStr | Ultrafast end-to-end protein structure prediction enables high-throughput exploration of uncharacterized proteins |
title_full_unstemmed | Ultrafast end-to-end protein structure prediction enables high-throughput exploration of uncharacterized proteins |
title_short | Ultrafast end-to-end protein structure prediction enables high-throughput exploration of uncharacterized proteins |
title_sort | ultrafast end-to-end protein structure prediction enables high-throughput exploration of uncharacterized proteins |
topic | Biological Sciences |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8795500/ https://www.ncbi.nlm.nih.gov/pubmed/35074909 http://dx.doi.org/10.1073/pnas.2113348119 |
work_keys_str_mv | AT kandathilshaunm ultrafastendtoendproteinstructurepredictionenableshighthroughputexplorationofuncharacterizedproteins AT greenerjoeg ultrafastendtoendproteinstructurepredictionenableshighthroughputexplorationofuncharacterizedproteins AT lauandym ultrafastendtoendproteinstructurepredictionenableshighthroughputexplorationofuncharacterizedproteins AT jonesdavidt ultrafastendtoendproteinstructurepredictionenableshighthroughputexplorationofuncharacterizedproteins |