Cargando…

Revealing and avoiding bias in semantic similarity scores for protein pairs

BACKGROUND: Semantic similarity scores for protein pairs are widely applied in functional genomic researches for finding functional clusters of proteins, predicting protein functions and protein-protein interactions, and for identifying putative disease genes. However, because some proteins, such as...

Descripción completa

Detalles Bibliográficos
Autores principales: Wang, Jing, Zhou, Xianxiao, Zhu, Jing, Zhou, Chenggui, Guo, Zheng
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2010
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2903568/
https://www.ncbi.nlm.nih.gov/pubmed/20509916
http://dx.doi.org/10.1186/1471-2105-11-290
_version_ 1782183817006546944
author Wang, Jing
Zhou, Xianxiao
Zhu, Jing
Zhou, Chenggui
Guo, Zheng
author_facet Wang, Jing
Zhou, Xianxiao
Zhu, Jing
Zhou, Chenggui
Guo, Zheng
author_sort Wang, Jing
collection PubMed
description BACKGROUND: Semantic similarity scores for protein pairs are widely applied in functional genomic researches for finding functional clusters of proteins, predicting protein functions and protein-protein interactions, and for identifying putative disease genes. However, because some proteins, such as those related to diseases, tend to be studied more intensively, annotations are likely to be biased, which may affect applications based on semantic similarity measures. Thus, it is necessary to evaluate the effects of the bias on semantic similarity scores between proteins and then find a method to avoid them. RESULTS: First, we evaluated 14 commonly used semantic similarity scores for protein pairs and demonstrated that they significantly correlated with the numbers of annotation terms for the proteins (also known as the protein annotation length). These results suggested that current applications of the semantic similarity scores between proteins might be unreliable. Then, to reduce this annotation bias effect, we proposed normalizing the semantic similarity scores between proteins using the power transformation of the scores. We provide evidence that this improves performance in some applications. CONCLUSIONS: Current semantic similarity measures for protein pairs are highly dependent on protein annotation lengths, which are subject to biological research bias. This affects applications that are based on these semantic similarity scores, especially in clustering studies that rely on score magnitudes. The normalized scores proposed in this paper can reduce the effects of this bias to some extent.
format Text
id pubmed-2903568
institution National Center for Biotechnology Information
language English
publishDate 2010
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-29035682010-07-14 Revealing and avoiding bias in semantic similarity scores for protein pairs Wang, Jing Zhou, Xianxiao Zhu, Jing Zhou, Chenggui Guo, Zheng BMC Bioinformatics Research article BACKGROUND: Semantic similarity scores for protein pairs are widely applied in functional genomic researches for finding functional clusters of proteins, predicting protein functions and protein-protein interactions, and for identifying putative disease genes. However, because some proteins, such as those related to diseases, tend to be studied more intensively, annotations are likely to be biased, which may affect applications based on semantic similarity measures. Thus, it is necessary to evaluate the effects of the bias on semantic similarity scores between proteins and then find a method to avoid them. RESULTS: First, we evaluated 14 commonly used semantic similarity scores for protein pairs and demonstrated that they significantly correlated with the numbers of annotation terms for the proteins (also known as the protein annotation length). These results suggested that current applications of the semantic similarity scores between proteins might be unreliable. Then, to reduce this annotation bias effect, we proposed normalizing the semantic similarity scores between proteins using the power transformation of the scores. We provide evidence that this improves performance in some applications. CONCLUSIONS: Current semantic similarity measures for protein pairs are highly dependent on protein annotation lengths, which are subject to biological research bias. This affects applications that are based on these semantic similarity scores, especially in clustering studies that rely on score magnitudes. The normalized scores proposed in this paper can reduce the effects of this bias to some extent. BioMed Central 2010-05-28 /pmc/articles/PMC2903568/ /pubmed/20509916 http://dx.doi.org/10.1186/1471-2105-11-290 Text en Copyright ©2010 Wang et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research article
Wang, Jing
Zhou, Xianxiao
Zhu, Jing
Zhou, Chenggui
Guo, Zheng
Revealing and avoiding bias in semantic similarity scores for protein pairs
title Revealing and avoiding bias in semantic similarity scores for protein pairs
title_full Revealing and avoiding bias in semantic similarity scores for protein pairs
title_fullStr Revealing and avoiding bias in semantic similarity scores for protein pairs
title_full_unstemmed Revealing and avoiding bias in semantic similarity scores for protein pairs
title_short Revealing and avoiding bias in semantic similarity scores for protein pairs
title_sort revealing and avoiding bias in semantic similarity scores for protein pairs
topic Research article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2903568/
https://www.ncbi.nlm.nih.gov/pubmed/20509916
http://dx.doi.org/10.1186/1471-2105-11-290
work_keys_str_mv AT wangjing revealingandavoidingbiasinsemanticsimilarityscoresforproteinpairs
AT zhouxianxiao revealingandavoidingbiasinsemanticsimilarityscoresforproteinpairs
AT zhujing revealingandavoidingbiasinsemanticsimilarityscoresforproteinpairs
AT zhouchenggui revealingandavoidingbiasinsemanticsimilarityscoresforproteinpairs
AT guozheng revealingandavoidingbiasinsemanticsimilarityscoresforproteinpairs