Cargando…
Numerical Encodings of Amino Acids in Multivariate Gaussian Modeling of Protein Multiple Sequence Alignments
Residues in proteins that are in close spatial proximity are more prone to covariate as their interactions are likely to be preserved due to structural and evolutionary constraints. If we can detect and quantify such covariation, physical contacts may then be predicted in the structure of a protein...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
MDPI
2018
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6337344/ https://www.ncbi.nlm.nih.gov/pubmed/30597916 http://dx.doi.org/10.3390/molecules24010104 |
_version_ | 1783388232537341952 |
---|---|
author | Koehl, Patrice Orland, Henri Delarue, Marc |
author_facet | Koehl, Patrice Orland, Henri Delarue, Marc |
author_sort | Koehl, Patrice |
collection | PubMed |
description | Residues in proteins that are in close spatial proximity are more prone to covariate as their interactions are likely to be preserved due to structural and evolutionary constraints. If we can detect and quantify such covariation, physical contacts may then be predicted in the structure of a protein solely from the sequences that decorate it. To carry out such predictions, and following the work of others, we have implemented a multivariate Gaussian model to analyze correlation in multiple sequence alignments. We have explored and tested several numerical encodings of amino acids within this model. We have shown that 1D encodings based on amino acid biochemical and biophysical properties, as well as higher dimensional encodings computed from the principal components of experimentally derived mutation/substitution matrices, do not perform as well as a simple twenty dimensional encoding with each amino acid represented with a vector of one along its own dimension and zero elsewhere. The optimum obtained from representations based on substitution matrices is reached by using 10 to 12 principal components; the corresponding performance is less than the performance obtained with the 20-dimensional binary encoding. We highlight also the importance of the prior when constructing the multivariate Gaussian model of a multiple sequence alignment. |
format | Online Article Text |
id | pubmed-6337344 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2018 |
publisher | MDPI |
record_format | MEDLINE/PubMed |
spelling | pubmed-63373442019-01-25 Numerical Encodings of Amino Acids in Multivariate Gaussian Modeling of Protein Multiple Sequence Alignments Koehl, Patrice Orland, Henri Delarue, Marc Molecules Article Residues in proteins that are in close spatial proximity are more prone to covariate as their interactions are likely to be preserved due to structural and evolutionary constraints. If we can detect and quantify such covariation, physical contacts may then be predicted in the structure of a protein solely from the sequences that decorate it. To carry out such predictions, and following the work of others, we have implemented a multivariate Gaussian model to analyze correlation in multiple sequence alignments. We have explored and tested several numerical encodings of amino acids within this model. We have shown that 1D encodings based on amino acid biochemical and biophysical properties, as well as higher dimensional encodings computed from the principal components of experimentally derived mutation/substitution matrices, do not perform as well as a simple twenty dimensional encoding with each amino acid represented with a vector of one along its own dimension and zero elsewhere. The optimum obtained from representations based on substitution matrices is reached by using 10 to 12 principal components; the corresponding performance is less than the performance obtained with the 20-dimensional binary encoding. We highlight also the importance of the prior when constructing the multivariate Gaussian model of a multiple sequence alignment. MDPI 2018-12-28 /pmc/articles/PMC6337344/ /pubmed/30597916 http://dx.doi.org/10.3390/molecules24010104 Text en © 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Article Koehl, Patrice Orland, Henri Delarue, Marc Numerical Encodings of Amino Acids in Multivariate Gaussian Modeling of Protein Multiple Sequence Alignments |
title | Numerical Encodings of Amino Acids in Multivariate Gaussian Modeling of Protein Multiple Sequence Alignments |
title_full | Numerical Encodings of Amino Acids in Multivariate Gaussian Modeling of Protein Multiple Sequence Alignments |
title_fullStr | Numerical Encodings of Amino Acids in Multivariate Gaussian Modeling of Protein Multiple Sequence Alignments |
title_full_unstemmed | Numerical Encodings of Amino Acids in Multivariate Gaussian Modeling of Protein Multiple Sequence Alignments |
title_short | Numerical Encodings of Amino Acids in Multivariate Gaussian Modeling of Protein Multiple Sequence Alignments |
title_sort | numerical encodings of amino acids in multivariate gaussian modeling of protein multiple sequence alignments |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6337344/ https://www.ncbi.nlm.nih.gov/pubmed/30597916 http://dx.doi.org/10.3390/molecules24010104 |
work_keys_str_mv | AT koehlpatrice numericalencodingsofaminoacidsinmultivariategaussianmodelingofproteinmultiplesequencealignments AT orlandhenri numericalencodingsofaminoacidsinmultivariategaussianmodelingofproteinmultiplesequencealignments AT delaruemarc numericalencodingsofaminoacidsinmultivariategaussianmodelingofproteinmultiplesequencealignments |