Cargando…

FastMG: a simple, fast, and accurate maximum likelihood procedure to estimate amino acid replacement rate matrices from large data sets

BACKGROUND: Amino acid replacement rate matrices are a crucial component of many protein analysis systems such as sequence similarity search, sequence alignment, and phylogenetic inference. Ideally, the rate matrix reflects the mutational behavior of the actual data under study; however, estimating...

Descripción completa

Detalles Bibliográficos
Autores principales:	Dang, Cuong Cao, Le, Vinh Sy, Gascuel, Olivier, Hazes, Bart, Le, Quang Si
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2014
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4287512/ https://www.ncbi.nlm.nih.gov/pubmed/25344302 http://dx.doi.org/10.1186/1471-2105-15-341

_version_	1782351800905498624
author	Dang, Cuong Cao Le, Vinh Sy Gascuel, Olivier Hazes, Bart Le, Quang Si
author_facet	Dang, Cuong Cao Le, Vinh Sy Gascuel, Olivier Hazes, Bart Le, Quang Si
author_sort	Dang, Cuong Cao
collection	PubMed
description	BACKGROUND: Amino acid replacement rate matrices are a crucial component of many protein analysis systems such as sequence similarity search, sequence alignment, and phylogenetic inference. Ideally, the rate matrix reflects the mutational behavior of the actual data under study; however, estimating amino acid replacement rate matrices requires large protein alignments and is computationally expensive and complex. As a compromise, sub-optimal pre-calculated generic matrices are typically used for protein-based phylogeny. Sequence availability has now grown to a point where problem-specific rate matrices can often be calculated if the computational cost can be controlled. RESULTS: The most time consuming step in estimating rate matrices by maximum likelihood is building maximum likelihood phylogenetic trees from protein alignments. We propose a new procedure, called FastMG, to overcome this obstacle. The key innovation is the alignment-splitting algorithm that splits alignments with many sequences into non-overlapping sub-alignments prior to estimating amino acid replacement rates. Experiments with different large data sets showed that the FastMG procedure was an order of magnitude faster than without splitting. Importantly, there was no apparent loss in matrix quality if an appropriate splitting procedure is used. CONCLUSIONS: FastMG is a simple, fast and accurate procedure to estimate amino acid replacement rate matrices from large data sets. It enables researchers to study the evolutionary relationships for specific groups of proteins or taxa with optimized, data-specific amino acid replacement rate matrices. The programs, data sets, and the new mammalian mitochondrial protein rate matrix are available at http://fastmg.codeplex.com.
format	Online Article Text
id	pubmed-4287512
institution	National Center for Biotechnology Information
language	English
publishDate	2014
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-42875122015-01-09 FastMG: a simple, fast, and accurate maximum likelihood procedure to estimate amino acid replacement rate matrices from large data sets Dang, Cuong Cao Le, Vinh Sy Gascuel, Olivier Hazes, Bart Le, Quang Si BMC Bioinformatics Research Article BACKGROUND: Amino acid replacement rate matrices are a crucial component of many protein analysis systems such as sequence similarity search, sequence alignment, and phylogenetic inference. Ideally, the rate matrix reflects the mutational behavior of the actual data under study; however, estimating amino acid replacement rate matrices requires large protein alignments and is computationally expensive and complex. As a compromise, sub-optimal pre-calculated generic matrices are typically used for protein-based phylogeny. Sequence availability has now grown to a point where problem-specific rate matrices can often be calculated if the computational cost can be controlled. RESULTS: The most time consuming step in estimating rate matrices by maximum likelihood is building maximum likelihood phylogenetic trees from protein alignments. We propose a new procedure, called FastMG, to overcome this obstacle. The key innovation is the alignment-splitting algorithm that splits alignments with many sequences into non-overlapping sub-alignments prior to estimating amino acid replacement rates. Experiments with different large data sets showed that the FastMG procedure was an order of magnitude faster than without splitting. Importantly, there was no apparent loss in matrix quality if an appropriate splitting procedure is used. CONCLUSIONS: FastMG is a simple, fast and accurate procedure to estimate amino acid replacement rate matrices from large data sets. It enables researchers to study the evolutionary relationships for specific groups of proteins or taxa with optimized, data-specific amino acid replacement rate matrices. The programs, data sets, and the new mammalian mitochondrial protein rate matrix are available at http://fastmg.codeplex.com. BioMed Central 2014-10-24 /pmc/articles/PMC4287512/ /pubmed/25344302 http://dx.doi.org/10.1186/1471-2105-15-341 Text en © Dang et al.; licensee BioMed Central Ltd. 2014 This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Dang, Cuong Cao Le, Vinh Sy Gascuel, Olivier Hazes, Bart Le, Quang Si FastMG: a simple, fast, and accurate maximum likelihood procedure to estimate amino acid replacement rate matrices from large data sets
title	FastMG: a simple, fast, and accurate maximum likelihood procedure to estimate amino acid replacement rate matrices from large data sets
title_full	FastMG: a simple, fast, and accurate maximum likelihood procedure to estimate amino acid replacement rate matrices from large data sets
title_fullStr	FastMG: a simple, fast, and accurate maximum likelihood procedure to estimate amino acid replacement rate matrices from large data sets
title_full_unstemmed	FastMG: a simple, fast, and accurate maximum likelihood procedure to estimate amino acid replacement rate matrices from large data sets
title_short	FastMG: a simple, fast, and accurate maximum likelihood procedure to estimate amino acid replacement rate matrices from large data sets
title_sort	fastmg: a simple, fast, and accurate maximum likelihood procedure to estimate amino acid replacement rate matrices from large data sets
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4287512/ https://www.ncbi.nlm.nih.gov/pubmed/25344302 http://dx.doi.org/10.1186/1471-2105-15-341
work_keys_str_mv	AT dangcuongcao fastmgasimplefastandaccuratemaximumlikelihoodproceduretoestimateaminoacidreplacementratematricesfromlargedatasets AT levinhsy fastmgasimplefastandaccuratemaximumlikelihoodproceduretoestimateaminoacidreplacementratematricesfromlargedatasets AT gascuelolivier fastmgasimplefastandaccuratemaximumlikelihoodproceduretoestimateaminoacidreplacementratematricesfromlargedatasets AT hazesbart fastmgasimplefastandaccuratemaximumlikelihoodproceduretoestimateaminoacidreplacementratematricesfromlargedatasets AT lequangsi fastmgasimplefastandaccuratemaximumlikelihoodproceduretoestimateaminoacidreplacementratematricesfromlargedatasets

FastMG: a simple, fast, and accurate maximum likelihood procedure to estimate amino acid replacement rate matrices from large data sets

Ejemplares similares