Cargando…

Numerical Encodings of Amino Acids in Multivariate Gaussian Modeling of Protein Multiple Sequence Alignments

Residues in proteins that are in close spatial proximity are more prone to covariate as their interactions are likely to be preserved due to structural and evolutionary constraints. If we can detect and quantify such covariation, physical contacts may then be predicted in the structure of a protein...

Descripción completa

Detalles Bibliográficos
Autores principales: Koehl, Patrice, Orland, Henri, Delarue, Marc
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6337344/
https://www.ncbi.nlm.nih.gov/pubmed/30597916
http://dx.doi.org/10.3390/molecules24010104
_version_ 1783388232537341952
author Koehl, Patrice
Orland, Henri
Delarue, Marc
author_facet Koehl, Patrice
Orland, Henri
Delarue, Marc
author_sort Koehl, Patrice
collection PubMed
description Residues in proteins that are in close spatial proximity are more prone to covariate as their interactions are likely to be preserved due to structural and evolutionary constraints. If we can detect and quantify such covariation, physical contacts may then be predicted in the structure of a protein solely from the sequences that decorate it. To carry out such predictions, and following the work of others, we have implemented a multivariate Gaussian model to analyze correlation in multiple sequence alignments. We have explored and tested several numerical encodings of amino acids within this model. We have shown that 1D encodings based on amino acid biochemical and biophysical properties, as well as higher dimensional encodings computed from the principal components of experimentally derived mutation/substitution matrices, do not perform as well as a simple twenty dimensional encoding with each amino acid represented with a vector of one along its own dimension and zero elsewhere. The optimum obtained from representations based on substitution matrices is reached by using 10 to 12 principal components; the corresponding performance is less than the performance obtained with the 20-dimensional binary encoding. We highlight also the importance of the prior when constructing the multivariate Gaussian model of a multiple sequence alignment.
format Online
Article
Text
id pubmed-6337344
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-63373442019-01-25 Numerical Encodings of Amino Acids in Multivariate Gaussian Modeling of Protein Multiple Sequence Alignments Koehl, Patrice Orland, Henri Delarue, Marc Molecules Article Residues in proteins that are in close spatial proximity are more prone to covariate as their interactions are likely to be preserved due to structural and evolutionary constraints. If we can detect and quantify such covariation, physical contacts may then be predicted in the structure of a protein solely from the sequences that decorate it. To carry out such predictions, and following the work of others, we have implemented a multivariate Gaussian model to analyze correlation in multiple sequence alignments. We have explored and tested several numerical encodings of amino acids within this model. We have shown that 1D encodings based on amino acid biochemical and biophysical properties, as well as higher dimensional encodings computed from the principal components of experimentally derived mutation/substitution matrices, do not perform as well as a simple twenty dimensional encoding with each amino acid represented with a vector of one along its own dimension and zero elsewhere. The optimum obtained from representations based on substitution matrices is reached by using 10 to 12 principal components; the corresponding performance is less than the performance obtained with the 20-dimensional binary encoding. We highlight also the importance of the prior when constructing the multivariate Gaussian model of a multiple sequence alignment. MDPI 2018-12-28 /pmc/articles/PMC6337344/ /pubmed/30597916 http://dx.doi.org/10.3390/molecules24010104 Text en © 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Koehl, Patrice
Orland, Henri
Delarue, Marc
Numerical Encodings of Amino Acids in Multivariate Gaussian Modeling of Protein Multiple Sequence Alignments
title Numerical Encodings of Amino Acids in Multivariate Gaussian Modeling of Protein Multiple Sequence Alignments
title_full Numerical Encodings of Amino Acids in Multivariate Gaussian Modeling of Protein Multiple Sequence Alignments
title_fullStr Numerical Encodings of Amino Acids in Multivariate Gaussian Modeling of Protein Multiple Sequence Alignments
title_full_unstemmed Numerical Encodings of Amino Acids in Multivariate Gaussian Modeling of Protein Multiple Sequence Alignments
title_short Numerical Encodings of Amino Acids in Multivariate Gaussian Modeling of Protein Multiple Sequence Alignments
title_sort numerical encodings of amino acids in multivariate gaussian modeling of protein multiple sequence alignments
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6337344/
https://www.ncbi.nlm.nih.gov/pubmed/30597916
http://dx.doi.org/10.3390/molecules24010104
work_keys_str_mv AT koehlpatrice numericalencodingsofaminoacidsinmultivariategaussianmodelingofproteinmultiplesequencealignments
AT orlandhenri numericalencodingsofaminoacidsinmultivariategaussianmodelingofproteinmultiplesequencealignments
AT delaruemarc numericalencodingsofaminoacidsinmultivariategaussianmodelingofproteinmultiplesequencealignments