Cargando…

Data-specific substitution models improve protein-based phylogenetics

Calculating amino-acid substitution models that are specific for individual protein data sets is often difficult due to the computational burden of estimating large numbers of rate parameters. In this study, we tested the computational efficiency and accuracy of five methods used to estimate substit...

Descripción completa

Detalles Bibliográficos
Autores principales:	Brazão, João M., Foster, Peter G., Cox, Cymon J.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	PeerJ Inc. 2023
Materias:	Bioinformatics
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10416777/ https://www.ncbi.nlm.nih.gov/pubmed/37576497 http://dx.doi.org/10.7717/peerj.15716

_version_	1785087857941020672
author	Brazão, João M. Foster, Peter G. Cox, Cymon J.
author_facet	Brazão, João M. Foster, Peter G. Cox, Cymon J.
author_sort	Brazão, João M.
collection	PubMed
description	Calculating amino-acid substitution models that are specific for individual protein data sets is often difficult due to the computational burden of estimating large numbers of rate parameters. In this study, we tested the computational efficiency and accuracy of five methods used to estimate substitution models, namely Codeml, FastMG, IQ-TREE, P4 (maximum likelihood), and P4 (Bayesian inference). Data-specific substitution models were estimated from simulated alignments (with different lengths) that were generated from a known simulation model and simulation tree. Each of the resulting data-specific substitution models was used to calculate the maximum likelihood score of the simulation tree and simulated data that was used to calculate the model, and compared with the maximum likelihood scores of the known simulation model and simulation tree on the same simulated data. Additionally, the commonly-used empirical models, cpREV and WAG, were assessed similarly. Data-specific models performed better than the empirical models, which under-fitted the simulated alignments, had the highest difference to the simulation model maximum-likelihood score, clustered further from the simulation model in principal component analysis ordination, and inferred less accurate trees. Data-specific models and the simulation model shared statistically indistinguishable maximum-likelihood scores, indicating that the five methods were reasonably accurate at estimating substitution models by this measure. Nevertheless, tree statistics showed differences between optimal maximum likelihood trees. Unlike other model estimating methods, trees inferred using data-specific models generated with IQ-TREE and P4 (maximum likelihood) were not significantly different from the trees derived from the simulation model in each analysis, indicating that these two methods alone were the most accurate at estimating data-specific models. To show the benefits of using data-specific protein models several published data sets were reanalysed using IQ-TREE-estimated models. These newly estimated models were a better fit to the data than the empirical models that were used by the original authors, often inferred longer trees, and resulted in different tree topologies in more than half of the re-analysed data sets. The results of this study show that software availability and high computation burden are not limitations to generating better-fitting data-specific amino-acid substitution models for phylogenetic analyses.
format	Online Article Text
id	pubmed-10416777
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	PeerJ Inc.
record_format	MEDLINE/PubMed
spelling	pubmed-104167772023-08-12 Data-specific substitution models improve protein-based phylogenetics Brazão, João M. Foster, Peter G. Cox, Cymon J. PeerJ Bioinformatics Calculating amino-acid substitution models that are specific for individual protein data sets is often difficult due to the computational burden of estimating large numbers of rate parameters. In this study, we tested the computational efficiency and accuracy of five methods used to estimate substitution models, namely Codeml, FastMG, IQ-TREE, P4 (maximum likelihood), and P4 (Bayesian inference). Data-specific substitution models were estimated from simulated alignments (with different lengths) that were generated from a known simulation model and simulation tree. Each of the resulting data-specific substitution models was used to calculate the maximum likelihood score of the simulation tree and simulated data that was used to calculate the model, and compared with the maximum likelihood scores of the known simulation model and simulation tree on the same simulated data. Additionally, the commonly-used empirical models, cpREV and WAG, were assessed similarly. Data-specific models performed better than the empirical models, which under-fitted the simulated alignments, had the highest difference to the simulation model maximum-likelihood score, clustered further from the simulation model in principal component analysis ordination, and inferred less accurate trees. Data-specific models and the simulation model shared statistically indistinguishable maximum-likelihood scores, indicating that the five methods were reasonably accurate at estimating substitution models by this measure. Nevertheless, tree statistics showed differences between optimal maximum likelihood trees. Unlike other model estimating methods, trees inferred using data-specific models generated with IQ-TREE and P4 (maximum likelihood) were not significantly different from the trees derived from the simulation model in each analysis, indicating that these two methods alone were the most accurate at estimating data-specific models. To show the benefits of using data-specific protein models several published data sets were reanalysed using IQ-TREE-estimated models. These newly estimated models were a better fit to the data than the empirical models that were used by the original authors, often inferred longer trees, and resulted in different tree topologies in more than half of the re-analysed data sets. The results of this study show that software availability and high computation burden are not limitations to generating better-fitting data-specific amino-acid substitution models for phylogenetic analyses. PeerJ Inc. 2023-08-08 /pmc/articles/PMC10416777/ /pubmed/37576497 http://dx.doi.org/10.7717/peerj.15716 Text en © 2023 Brazão et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.
spellingShingle	Bioinformatics Brazão, João M. Foster, Peter G. Cox, Cymon J. Data-specific substitution models improve protein-based phylogenetics
title	Data-specific substitution models improve protein-based phylogenetics
title_full	Data-specific substitution models improve protein-based phylogenetics
title_fullStr	Data-specific substitution models improve protein-based phylogenetics
title_full_unstemmed	Data-specific substitution models improve protein-based phylogenetics
title_short	Data-specific substitution models improve protein-based phylogenetics
title_sort	data-specific substitution models improve protein-based phylogenetics
topic	Bioinformatics
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10416777/ https://www.ncbi.nlm.nih.gov/pubmed/37576497 http://dx.doi.org/10.7717/peerj.15716
work_keys_str_mv	AT brazaojoaom dataspecificsubstitutionmodelsimproveproteinbasedphylogenetics AT fosterpeterg dataspecificsubstitutionmodelsimproveproteinbasedphylogenetics AT coxcymonj dataspecificsubstitutionmodelsimproveproteinbasedphylogenetics

Data-specific substitution models improve protein-based phylogenetics

Ejemplares similares