Cargando…

'Genome order index' should not be used for defining compositional constraints in nucleotide sequences - a case study of the Z-curve

BACKGROUND: The Z-curve is a three dimensional representation of DNA sequences proposed over a decade ago and has been extensively applied to sequence segmentation, horizontal gene transfer detection, and sequence analysis. Based on the Z-curve, a "genome order index," was proposed, which...

Descripción completa

Detalles Bibliográficos
Autores principales:	Elhaik, Eran, Graur, Dan, Josić, Krešimir
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2010
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2841071/ https://www.ncbi.nlm.nih.gov/pubmed/20158921 http://dx.doi.org/10.1186/1745-6150-5-10

_version_	1782179062749331456
author	Elhaik, Eran Graur, Dan Josić, Krešimir
author_facet	Elhaik, Eran Graur, Dan Josić, Krešimir
author_sort	Elhaik, Eran
collection	PubMed
description	BACKGROUND: The Z-curve is a three dimensional representation of DNA sequences proposed over a decade ago and has been extensively applied to sequence segmentation, horizontal gene transfer detection, and sequence analysis. Based on the Z-curve, a "genome order index," was proposed, which is defined as S = a(2)+ c(2)+t(2)+g(2), where a, c, t, and g are the nucleotide frequencies of A, C, T, and G, respectively. This index was found to be smaller than 1/3 for almost all tested genomes, which was taken as support for the existence of a constraint on genome composition. A geometric explanation for this constraint has been suggested. Each genome was represented by a point P whose distance from the four faces of a regular tetrahedron was given by the frequencies a, c, t, and g. They claimed that an inscribed sphere of radius r = 1/[Image: see text] contains almost all points corresponding to various genomes, implying that S <r(2). The distribution of the points P obtained by S was studied using the Z-curve. RESULTS: In this work, we studied the basic properties of the Z-curve using the "genome order index" as a case study. We show that (1) the calculation of the radius of the inscribed sphere of a regular tetrahedron is incorrect, (2) the S index is narrowly distributed, (3) based on the second parity rule, the S index can be derived directly from the Shannon entropy and is, therefore, redundant, and (4) the Z-curve suffers from over dimensionality, and the dimension stands for GC content alone suffices to represent any given genome. CONCLUSION: The "genome order index" S does not represent a constraint on nucleotide composition. Moreover, S can be easily computed from the Gini-Simpson index and be directly derived from entropy and is redundant. Overall, the Z-curve and S are over-complicated measures to GC content and Shannon H index, respectively. REVIEWERS: This article was reviewed by Claus Wilke, Joel Bader, Marek Kimmel and Uladzislau Hryshkevich (nominated by Itai Yanai).
format	Text
id	pubmed-2841071
institution	National Center for Biotechnology Information
language	English
publishDate	2010
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-28410712010-03-18 'Genome order index' should not be used for defining compositional constraints in nucleotide sequences - a case study of the Z-curve Elhaik, Eran Graur, Dan Josić, Krešimir Biol Direct Research BACKGROUND: The Z-curve is a three dimensional representation of DNA sequences proposed over a decade ago and has been extensively applied to sequence segmentation, horizontal gene transfer detection, and sequence analysis. Based on the Z-curve, a "genome order index," was proposed, which is defined as S = a(2)+ c(2)+t(2)+g(2), where a, c, t, and g are the nucleotide frequencies of A, C, T, and G, respectively. This index was found to be smaller than 1/3 for almost all tested genomes, which was taken as support for the existence of a constraint on genome composition. A geometric explanation for this constraint has been suggested. Each genome was represented by a point P whose distance from the four faces of a regular tetrahedron was given by the frequencies a, c, t, and g. They claimed that an inscribed sphere of radius r = 1/[Image: see text] contains almost all points corresponding to various genomes, implying that S <r(2). The distribution of the points P obtained by S was studied using the Z-curve. RESULTS: In this work, we studied the basic properties of the Z-curve using the "genome order index" as a case study. We show that (1) the calculation of the radius of the inscribed sphere of a regular tetrahedron is incorrect, (2) the S index is narrowly distributed, (3) based on the second parity rule, the S index can be derived directly from the Shannon entropy and is, therefore, redundant, and (4) the Z-curve suffers from over dimensionality, and the dimension stands for GC content alone suffices to represent any given genome. CONCLUSION: The "genome order index" S does not represent a constraint on nucleotide composition. Moreover, S can be easily computed from the Gini-Simpson index and be directly derived from entropy and is redundant. Overall, the Z-curve and S are over-complicated measures to GC content and Shannon H index, respectively. REVIEWERS: This article was reviewed by Claus Wilke, Joel Bader, Marek Kimmel and Uladzislau Hryshkevich (nominated by Itai Yanai). BioMed Central 2010-02-17 /pmc/articles/PMC2841071/ /pubmed/20158921 http://dx.doi.org/10.1186/1745-6150-5-10 Text en Copyright ©2010 Elhaik et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Elhaik, Eran Graur, Dan Josić, Krešimir 'Genome order index' should not be used for defining compositional constraints in nucleotide sequences - a case study of the Z-curve
title	'Genome order index' should not be used for defining compositional constraints in nucleotide sequences - a case study of the Z-curve
title_full	'Genome order index' should not be used for defining compositional constraints in nucleotide sequences - a case study of the Z-curve
title_fullStr	'Genome order index' should not be used for defining compositional constraints in nucleotide sequences - a case study of the Z-curve
title_full_unstemmed	'Genome order index' should not be used for defining compositional constraints in nucleotide sequences - a case study of the Z-curve
title_short	'Genome order index' should not be used for defining compositional constraints in nucleotide sequences - a case study of the Z-curve
title_sort	'genome order index' should not be used for defining compositional constraints in nucleotide sequences - a case study of the z-curve
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2841071/ https://www.ncbi.nlm.nih.gov/pubmed/20158921 http://dx.doi.org/10.1186/1745-6150-5-10
work_keys_str_mv	AT elhaikeran genomeorderindexshouldnotbeusedfordefiningcompositionalconstraintsinnucleotidesequencesacasestudyofthezcurve AT graurdan genomeorderindexshouldnotbeusedfordefiningcompositionalconstraintsinnucleotidesequencesacasestudyofthezcurve AT josickresimir genomeorderindexshouldnotbeusedfordefiningcompositionalconstraintsinnucleotidesequencesacasestudyofthezcurve

'Genome order index' should not be used for defining compositional constraints in nucleotide sequences - a case study of the Z-curve

Ejemplares similares