Cargando…

Pitfalls of the most commonly used models of context dependent substitution

BACKGROUND: Neighboring nucleotides exert a striking influence on mutation, with the hypermutability of CpG dinucleotides in many genomes being an exemplar. Among the approaches employed to measure the relative importance of sequence neighbors on molecular evolution have been continuous-time Markov...

Descripción completa

Detalles Bibliográficos
Autores principales:	Lindsay, Helen, Yap, Von Bing, Ying, Hua, Huttley, Gavin A
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2008
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2628887/ https://www.ncbi.nlm.nih.gov/pubmed/19087239 http://dx.doi.org/10.1186/1745-6150-3-52

_version_	1782163744248299520
author	Lindsay, Helen Yap, Von Bing Ying, Hua Huttley, Gavin A
author_facet	Lindsay, Helen Yap, Von Bing Ying, Hua Huttley, Gavin A
author_sort	Lindsay, Helen
collection	PubMed
description	BACKGROUND: Neighboring nucleotides exert a striking influence on mutation, with the hypermutability of CpG dinucleotides in many genomes being an exemplar. Among the approaches employed to measure the relative importance of sequence neighbors on molecular evolution have been continuous-time Markov process models for substitutions that treat sequences as a series of independent tuples. The most widely used examples are the codon substitution models. We evaluated the suitability of derivatives of the nucleotide frequency weighted (hereafter NF) and tuple frequency weighted (hereafter TF) models for measuring sequence context dependent substitution. Critical properties we address are their relationships to an independent nucleotide process and the robustness of parameter estimation to changes in sequence composition. We then consider the impact on inference concerning dinucleotide substitution processes from application of these two forms to intron sequence alignments from primates. RESULTS: We prove that the NF form always nests the independent nucleotide process and that this is not true for the TF form. As a consequence, using TF to study context effects can be misleading, which is shown by both theoretical calculations and simulations. We describe a simple example where a context parameter estimated under TF is confounded with composition terms unless all sequence states are equi-frequent. We illustrate this for the dinucleotide case by simulation under a nucleotide model, showing that the TF form identifies a CpG effect when none exists. Our analysis of primate introns revealed that the effect of nucleotide neighbors is over-estimated under TF compared with NF. Parameter estimates for a number of contexts are also strikingly discordant between the two model forms. CONCLUSION: Our results establish that the NF form should be used for analysis of independent-tuple context dependent processes. Although neighboring effects in general are still important, prominent influences such as the elevated CpG transversion rate previously identified using the TF form are an artifact. Our results further suggest as few as 5 parameters may account for ~85% of neighboring nucleotide influence. REVIEWERS: This article was reviewed by Dr Rob Knight, Dr Josh Cherry (nominated by Dr David Lipman) and Dr Stephen Altschul (nominated by Dr David Lipman).
format	Text
id	pubmed-2628887
institution	National Center for Biotechnology Information
language	English
publishDate	2008
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-26288872009-01-21 Pitfalls of the most commonly used models of context dependent substitution Lindsay, Helen Yap, Von Bing Ying, Hua Huttley, Gavin A Biol Direct Research BACKGROUND: Neighboring nucleotides exert a striking influence on mutation, with the hypermutability of CpG dinucleotides in many genomes being an exemplar. Among the approaches employed to measure the relative importance of sequence neighbors on molecular evolution have been continuous-time Markov process models for substitutions that treat sequences as a series of independent tuples. The most widely used examples are the codon substitution models. We evaluated the suitability of derivatives of the nucleotide frequency weighted (hereafter NF) and tuple frequency weighted (hereafter TF) models for measuring sequence context dependent substitution. Critical properties we address are their relationships to an independent nucleotide process and the robustness of parameter estimation to changes in sequence composition. We then consider the impact on inference concerning dinucleotide substitution processes from application of these two forms to intron sequence alignments from primates. RESULTS: We prove that the NF form always nests the independent nucleotide process and that this is not true for the TF form. As a consequence, using TF to study context effects can be misleading, which is shown by both theoretical calculations and simulations. We describe a simple example where a context parameter estimated under TF is confounded with composition terms unless all sequence states are equi-frequent. We illustrate this for the dinucleotide case by simulation under a nucleotide model, showing that the TF form identifies a CpG effect when none exists. Our analysis of primate introns revealed that the effect of nucleotide neighbors is over-estimated under TF compared with NF. Parameter estimates for a number of contexts are also strikingly discordant between the two model forms. CONCLUSION: Our results establish that the NF form should be used for analysis of independent-tuple context dependent processes. Although neighboring effects in general are still important, prominent influences such as the elevated CpG transversion rate previously identified using the TF form are an artifact. Our results further suggest as few as 5 parameters may account for ~85% of neighboring nucleotide influence. REVIEWERS: This article was reviewed by Dr Rob Knight, Dr Josh Cherry (nominated by Dr David Lipman) and Dr Stephen Altschul (nominated by Dr David Lipman). BioMed Central 2008-12-16 /pmc/articles/PMC2628887/ /pubmed/19087239 http://dx.doi.org/10.1186/1745-6150-3-52 Text en Copyright © 2008 Lindsay et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Lindsay, Helen Yap, Von Bing Ying, Hua Huttley, Gavin A Pitfalls of the most commonly used models of context dependent substitution
title	Pitfalls of the most commonly used models of context dependent substitution
title_full	Pitfalls of the most commonly used models of context dependent substitution
title_fullStr	Pitfalls of the most commonly used models of context dependent substitution
title_full_unstemmed	Pitfalls of the most commonly used models of context dependent substitution
title_short	Pitfalls of the most commonly used models of context dependent substitution
title_sort	pitfalls of the most commonly used models of context dependent substitution
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2628887/ https://www.ncbi.nlm.nih.gov/pubmed/19087239 http://dx.doi.org/10.1186/1745-6150-3-52
work_keys_str_mv	AT lindsayhelen pitfallsofthemostcommonlyusedmodelsofcontextdependentsubstitution AT yapvonbing pitfallsofthemostcommonlyusedmodelsofcontextdependentsubstitution AT yinghua pitfallsofthemostcommonlyusedmodelsofcontextdependentsubstitution AT huttleygavina pitfallsofthemostcommonlyusedmodelsofcontextdependentsubstitution

Pitfalls of the most commonly used models of context dependent substitution

Ejemplares similares