Cargando…

Comparative study of the effectiveness and limitations of current methods for detecting sequence coevolution

Motivation: With rapid accumulation of sequence data on several species, extracting rational and systematic information from multiple sequence alignments (MSAs) is becoming increasingly important. Currently, there is a plethora of computational methods for investigating coupled evolutionary changes...

Descripción completa

Detalles Bibliográficos
Autores principales: Mao, Wenzhi, Kaya, Cihan, Dutta, Anindita, Horovitz, Amnon, Bahar, Ivet
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4481699/
https://www.ncbi.nlm.nih.gov/pubmed/25697822
http://dx.doi.org/10.1093/bioinformatics/btv103
Descripción
Sumario:Motivation: With rapid accumulation of sequence data on several species, extracting rational and systematic information from multiple sequence alignments (MSAs) is becoming increasingly important. Currently, there is a plethora of computational methods for investigating coupled evolutionary changes in pairs of positions along the amino acid sequence, and making inferences on structure and function. Yet, the significance of coevolution signals remains to be established. Also, a large number of false positives (FPs) arise from insufficient MSA size, phylogenetic background and indirect couplings. Results: Here, a set of 16 pairs of non-interacting proteins is thoroughly examined to assess the effectiveness and limitations of different methods. The analysis shows that recent computationally expensive methods designed to remove biases from indirect couplings outperform others in detecting tertiary structural contacts as well as eliminating intermolecular FPs; whereas traditional methods such as mutual information benefit from refinements such as shuffling, while being highly efficient. Computations repeated with 2,330 pairs of protein families from the Negatome database corroborated these results. Finally, using a training dataset of 162 families of proteins, we propose a combined method that outperforms existing individual methods. Overall, the study provides simple guidelines towards the choice of suitable methods and strategies based on available MSA size and computing resources. Availability and implementation: Software is freely available through the Evol component of ProDy API. Contact: bahar@pitt.edu Supplementary information: Supplementary data are available at Bioinformatics online.