Cargando…

Synthetic protein alignments by CCMgen quantify noise in residue-residue contact prediction

Compensatory mutations between protein residues in physical contact can manifest themselves as statistical couplings between the corresponding columns in a multiple sequence alignment (MSA) of the protein family. Conversely, large coupling coefficients predict residue contacts. Methods for de-novo p...

Descripción completa

Detalles Bibliográficos
Autores principales: Vorberg, Susann, Seemayer, Stefan, Söding, Johannes
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6237422/
https://www.ncbi.nlm.nih.gov/pubmed/30395601
http://dx.doi.org/10.1371/journal.pcbi.1006526
_version_ 1783371192431804416
author Vorberg, Susann
Seemayer, Stefan
Söding, Johannes
author_facet Vorberg, Susann
Seemayer, Stefan
Söding, Johannes
author_sort Vorberg, Susann
collection PubMed
description Compensatory mutations between protein residues in physical contact can manifest themselves as statistical couplings between the corresponding columns in a multiple sequence alignment (MSA) of the protein family. Conversely, large coupling coefficients predict residue contacts. Methods for de-novo protein structure prediction based on this approach are becoming increasingly reliable. Their main limitation is the strong systematic and statistical noise in the estimation of coupling coefficients, which has so far limited their application to very large protein families. While most research has focused on improving predictions by adding external information, little progress has been made to improve the statistical procedure at the core, because our lack of understanding of the sources of noise poses a major obstacle. First, we show theoretically that the expectation value of the coupling score assuming no coupling is proportional to the product of the square roots of the column entropies, and we propose a simple entropy bias correction (EntC) that subtracts out this expectation value. Second, we show that the average product correction (APC) includes the correction of the entropy bias, partly explaining its success. Third, we have developed CCMgen, the first method for simulating protein evolution and generating realistic synthetic MSAs with pairwise statistical residue couplings. Fourth, to learn exact statistical models that reliably reproduce observed alignment statistics, we developed CCMpredPy, an implementation of the persistent contrastive divergence (PCD) method for exact inference. Fifth, we demonstrate how CCMgen and CCMpredPy can facilitate the development of contact prediction methods by analysing the systematic noise contributions from phylogeny and entropy. Using the entropy bias correction, we can disentangle both sources of noise and find that entropy contributes roughly twice as much noise as phylogeny.
format Online
Article
Text
id pubmed-6237422
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-62374222018-11-30 Synthetic protein alignments by CCMgen quantify noise in residue-residue contact prediction Vorberg, Susann Seemayer, Stefan Söding, Johannes PLoS Comput Biol Research Article Compensatory mutations between protein residues in physical contact can manifest themselves as statistical couplings between the corresponding columns in a multiple sequence alignment (MSA) of the protein family. Conversely, large coupling coefficients predict residue contacts. Methods for de-novo protein structure prediction based on this approach are becoming increasingly reliable. Their main limitation is the strong systematic and statistical noise in the estimation of coupling coefficients, which has so far limited their application to very large protein families. While most research has focused on improving predictions by adding external information, little progress has been made to improve the statistical procedure at the core, because our lack of understanding of the sources of noise poses a major obstacle. First, we show theoretically that the expectation value of the coupling score assuming no coupling is proportional to the product of the square roots of the column entropies, and we propose a simple entropy bias correction (EntC) that subtracts out this expectation value. Second, we show that the average product correction (APC) includes the correction of the entropy bias, partly explaining its success. Third, we have developed CCMgen, the first method for simulating protein evolution and generating realistic synthetic MSAs with pairwise statistical residue couplings. Fourth, to learn exact statistical models that reliably reproduce observed alignment statistics, we developed CCMpredPy, an implementation of the persistent contrastive divergence (PCD) method for exact inference. Fifth, we demonstrate how CCMgen and CCMpredPy can facilitate the development of contact prediction methods by analysing the systematic noise contributions from phylogeny and entropy. Using the entropy bias correction, we can disentangle both sources of noise and find that entropy contributes roughly twice as much noise as phylogeny. Public Library of Science 2018-11-05 /pmc/articles/PMC6237422/ /pubmed/30395601 http://dx.doi.org/10.1371/journal.pcbi.1006526 Text en © 2018 Vorberg et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Vorberg, Susann
Seemayer, Stefan
Söding, Johannes
Synthetic protein alignments by CCMgen quantify noise in residue-residue contact prediction
title Synthetic protein alignments by CCMgen quantify noise in residue-residue contact prediction
title_full Synthetic protein alignments by CCMgen quantify noise in residue-residue contact prediction
title_fullStr Synthetic protein alignments by CCMgen quantify noise in residue-residue contact prediction
title_full_unstemmed Synthetic protein alignments by CCMgen quantify noise in residue-residue contact prediction
title_short Synthetic protein alignments by CCMgen quantify noise in residue-residue contact prediction
title_sort synthetic protein alignments by ccmgen quantify noise in residue-residue contact prediction
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6237422/
https://www.ncbi.nlm.nih.gov/pubmed/30395601
http://dx.doi.org/10.1371/journal.pcbi.1006526
work_keys_str_mv AT vorbergsusann syntheticproteinalignmentsbyccmgenquantifynoiseinresidueresiduecontactprediction
AT seemayerstefan syntheticproteinalignmentsbyccmgenquantifynoiseinresidueresiduecontactprediction
AT sodingjohannes syntheticproteinalignmentsbyccmgenquantifynoiseinresidueresiduecontactprediction