Cargando…

Quantification of the variation in percentage identity for protein sequence alignments

BACKGROUND: Percentage Identity (PID) is frequently quoted in discussion of sequence alignments since it appears simple and easy to understand. However, although there are several different ways to calculate percentage identity and each may yield a different result for the same alignment, the method...

Descripción completa

Detalles Bibliográficos
Autores principales:	Raghava, GPS, Barton, Geoffrey J
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2006
Materias:	Correspondence
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1592310/ https://www.ncbi.nlm.nih.gov/pubmed/16984632 http://dx.doi.org/10.1186/1471-2105-7-415

_version_	1782130396264136704
author	Raghava, GPS Barton, Geoffrey J
author_facet	Raghava, GPS Barton, Geoffrey J
author_sort	Raghava, GPS
collection	PubMed
description	BACKGROUND: Percentage Identity (PID) is frequently quoted in discussion of sequence alignments since it appears simple and easy to understand. However, although there are several different ways to calculate percentage identity and each may yield a different result for the same alignment, the method of calculation is rarely reported. Accordingly, quantification of the variation in PID caused by the different calculations would help in interpreting PID values in the literature. In this study, the variation in PID was quantified systematically on a reference set of 1028 alignments generated by comparison of the protein three-dimensional structures. Since the alignment algorithm may also affect the range of PID, this study also considered the effect of algorithm, and the combination of algorithm and PID method. RESULTS: The maximum variation in PID due to the calculation method was 11.5% while the effect of alignment algorithm on PID was up to 14.6% across three popular alignment methods. The combined effect of alignment algorithm and PID calculation gave a variation of up to 22% on the test data, with an average of 5.3% ± 2.8% for sequence pairs with < 30% identity. In order to see which PID method was most highly correlated with structural similarity, four different PID calculations were compared to similarity scores (Sc) from the comparison of the corresponding protein three-dimensional structures. The highest correlation coefficient for a PID calculation was 0.80. In contrast, the more sophisticated Z-score calculated by reference to randomized sequences gave a correlation coefficient of 0.84. CONCLUSION: Although it is well known amongst expert sequence analysts that PID is a poor score for discriminating between protein sequences, the apparent simplicity of the percentage identity score encourages its widespread use in establishing cutoffs for structural similarity. This paper illustrates that not only is PID a poor measure of sequence similarity when compared to the Z-score, but that there is also a large uncertainty in reported PID values. Since better alternatives to PID exist to quantify sequence similarity, these should be quoted where possible in preference to PID. The findings presented here should prove helpful to those new to sequence analysis, and in warning those who seek to interpret the value of a PID reported in the literature.
format	Text
id	pubmed-1592310
institution	National Center for Biotechnology Information
language	English
publishDate	2006
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-15923102006-10-12 Quantification of the variation in percentage identity for protein sequence alignments Raghava, GPS Barton, Geoffrey J BMC Bioinformatics Correspondence BACKGROUND: Percentage Identity (PID) is frequently quoted in discussion of sequence alignments since it appears simple and easy to understand. However, although there are several different ways to calculate percentage identity and each may yield a different result for the same alignment, the method of calculation is rarely reported. Accordingly, quantification of the variation in PID caused by the different calculations would help in interpreting PID values in the literature. In this study, the variation in PID was quantified systematically on a reference set of 1028 alignments generated by comparison of the protein three-dimensional structures. Since the alignment algorithm may also affect the range of PID, this study also considered the effect of algorithm, and the combination of algorithm and PID method. RESULTS: The maximum variation in PID due to the calculation method was 11.5% while the effect of alignment algorithm on PID was up to 14.6% across three popular alignment methods. The combined effect of alignment algorithm and PID calculation gave a variation of up to 22% on the test data, with an average of 5.3% ± 2.8% for sequence pairs with < 30% identity. In order to see which PID method was most highly correlated with structural similarity, four different PID calculations were compared to similarity scores (Sc) from the comparison of the corresponding protein three-dimensional structures. The highest correlation coefficient for a PID calculation was 0.80. In contrast, the more sophisticated Z-score calculated by reference to randomized sequences gave a correlation coefficient of 0.84. CONCLUSION: Although it is well known amongst expert sequence analysts that PID is a poor score for discriminating between protein sequences, the apparent simplicity of the percentage identity score encourages its widespread use in establishing cutoffs for structural similarity. This paper illustrates that not only is PID a poor measure of sequence similarity when compared to the Z-score, but that there is also a large uncertainty in reported PID values. Since better alternatives to PID exist to quantify sequence similarity, these should be quoted where possible in preference to PID. The findings presented here should prove helpful to those new to sequence analysis, and in warning those who seek to interpret the value of a PID reported in the literature. BioMed Central 2006-09-19 /pmc/articles/PMC1592310/ /pubmed/16984632 http://dx.doi.org/10.1186/1471-2105-7-415 Text en Copyright © 2006 Raghava and Barton; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Correspondence Raghava, GPS Barton, Geoffrey J Quantification of the variation in percentage identity for protein sequence alignments
title	Quantification of the variation in percentage identity for protein sequence alignments
title_full	Quantification of the variation in percentage identity for protein sequence alignments
title_fullStr	Quantification of the variation in percentage identity for protein sequence alignments
title_full_unstemmed	Quantification of the variation in percentage identity for protein sequence alignments
title_short	Quantification of the variation in percentage identity for protein sequence alignments
title_sort	quantification of the variation in percentage identity for protein sequence alignments
topic	Correspondence
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1592310/ https://www.ncbi.nlm.nih.gov/pubmed/16984632 http://dx.doi.org/10.1186/1471-2105-7-415
work_keys_str_mv	AT raghavagps quantificationofthevariationinpercentageidentityforproteinsequencealignments AT bartongeoffreyj quantificationofthevariationinpercentageidentityforproteinsequencealignments

Quantification of the variation in percentage identity for protein sequence alignments

Ejemplares similares