Cargando…
'Genome order index' should not be used for defining compositional constraints in nucleotide sequences - a case study of the Z-curve
BACKGROUND: The Z-curve is a three dimensional representation of DNA sequences proposed over a decade ago and has been extensively applied to sequence segmentation, horizontal gene transfer detection, and sequence analysis. Based on the Z-curve, a "genome order index," was proposed, which...
Autores principales: | , , |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2010
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2841071/ https://www.ncbi.nlm.nih.gov/pubmed/20158921 http://dx.doi.org/10.1186/1745-6150-5-10 |
_version_ | 1782179062749331456 |
---|---|
author | Elhaik, Eran Graur, Dan Josić, Krešimir |
author_facet | Elhaik, Eran Graur, Dan Josić, Krešimir |
author_sort | Elhaik, Eran |
collection | PubMed |
description | BACKGROUND: The Z-curve is a three dimensional representation of DNA sequences proposed over a decade ago and has been extensively applied to sequence segmentation, horizontal gene transfer detection, and sequence analysis. Based on the Z-curve, a "genome order index," was proposed, which is defined as S = a(2)+ c(2)+t(2)+g(2), where a, c, t, and g are the nucleotide frequencies of A, C, T, and G, respectively. This index was found to be smaller than 1/3 for almost all tested genomes, which was taken as support for the existence of a constraint on genome composition. A geometric explanation for this constraint has been suggested. Each genome was represented by a point P whose distance from the four faces of a regular tetrahedron was given by the frequencies a, c, t, and g. They claimed that an inscribed sphere of radius r = 1/[Image: see text] contains almost all points corresponding to various genomes, implying that S <r(2). The distribution of the points P obtained by S was studied using the Z-curve. RESULTS: In this work, we studied the basic properties of the Z-curve using the "genome order index" as a case study. We show that (1) the calculation of the radius of the inscribed sphere of a regular tetrahedron is incorrect, (2) the S index is narrowly distributed, (3) based on the second parity rule, the S index can be derived directly from the Shannon entropy and is, therefore, redundant, and (4) the Z-curve suffers from over dimensionality, and the dimension stands for GC content alone suffices to represent any given genome. CONCLUSION: The "genome order index" S does not represent a constraint on nucleotide composition. Moreover, S can be easily computed from the Gini-Simpson index and be directly derived from entropy and is redundant. Overall, the Z-curve and S are over-complicated measures to GC content and Shannon H index, respectively. REVIEWERS: This article was reviewed by Claus Wilke, Joel Bader, Marek Kimmel and Uladzislau Hryshkevich (nominated by Itai Yanai). |
format | Text |
id | pubmed-2841071 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2010 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-28410712010-03-18 'Genome order index' should not be used for defining compositional constraints in nucleotide sequences - a case study of the Z-curve Elhaik, Eran Graur, Dan Josić, Krešimir Biol Direct Research BACKGROUND: The Z-curve is a three dimensional representation of DNA sequences proposed over a decade ago and has been extensively applied to sequence segmentation, horizontal gene transfer detection, and sequence analysis. Based on the Z-curve, a "genome order index," was proposed, which is defined as S = a(2)+ c(2)+t(2)+g(2), where a, c, t, and g are the nucleotide frequencies of A, C, T, and G, respectively. This index was found to be smaller than 1/3 for almost all tested genomes, which was taken as support for the existence of a constraint on genome composition. A geometric explanation for this constraint has been suggested. Each genome was represented by a point P whose distance from the four faces of a regular tetrahedron was given by the frequencies a, c, t, and g. They claimed that an inscribed sphere of radius r = 1/[Image: see text] contains almost all points corresponding to various genomes, implying that S <r(2). The distribution of the points P obtained by S was studied using the Z-curve. RESULTS: In this work, we studied the basic properties of the Z-curve using the "genome order index" as a case study. We show that (1) the calculation of the radius of the inscribed sphere of a regular tetrahedron is incorrect, (2) the S index is narrowly distributed, (3) based on the second parity rule, the S index can be derived directly from the Shannon entropy and is, therefore, redundant, and (4) the Z-curve suffers from over dimensionality, and the dimension stands for GC content alone suffices to represent any given genome. CONCLUSION: The "genome order index" S does not represent a constraint on nucleotide composition. Moreover, S can be easily computed from the Gini-Simpson index and be directly derived from entropy and is redundant. Overall, the Z-curve and S are over-complicated measures to GC content and Shannon H index, respectively. REVIEWERS: This article was reviewed by Claus Wilke, Joel Bader, Marek Kimmel and Uladzislau Hryshkevich (nominated by Itai Yanai). BioMed Central 2010-02-17 /pmc/articles/PMC2841071/ /pubmed/20158921 http://dx.doi.org/10.1186/1745-6150-5-10 Text en Copyright ©2010 Elhaik et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Research Elhaik, Eran Graur, Dan Josić, Krešimir 'Genome order index' should not be used for defining compositional constraints in nucleotide sequences - a case study of the Z-curve |
title | 'Genome order index' should not be used for defining compositional constraints in nucleotide sequences - a case study of the Z-curve |
title_full | 'Genome order index' should not be used for defining compositional constraints in nucleotide sequences - a case study of the Z-curve |
title_fullStr | 'Genome order index' should not be used for defining compositional constraints in nucleotide sequences - a case study of the Z-curve |
title_full_unstemmed | 'Genome order index' should not be used for defining compositional constraints in nucleotide sequences - a case study of the Z-curve |
title_short | 'Genome order index' should not be used for defining compositional constraints in nucleotide sequences - a case study of the Z-curve |
title_sort | 'genome order index' should not be used for defining compositional constraints in nucleotide sequences - a case study of the z-curve |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2841071/ https://www.ncbi.nlm.nih.gov/pubmed/20158921 http://dx.doi.org/10.1186/1745-6150-5-10 |
work_keys_str_mv | AT elhaikeran genomeorderindexshouldnotbeusedfordefiningcompositionalconstraintsinnucleotidesequencesacasestudyofthezcurve AT graurdan genomeorderindexshouldnotbeusedfordefiningcompositionalconstraintsinnucleotidesequencesacasestudyofthezcurve AT josickresimir genomeorderindexshouldnotbeusedfordefiningcompositionalconstraintsinnucleotidesequencesacasestudyofthezcurve |