Cargando…

On triangle inequalities of correlation-based distances for gene expression profiles

BACKGROUND: Distance functions are fundamental for evaluating the differences between gene expression profiles. Such a function would output a low value if the profiles are strongly correlated—either negatively or positively—and vice versa. One popular distance function is the absolute correlation d...

Descripción completa

Detalles Bibliográficos
Autores principales: Chen, Jiaxing, Ng, Yen Kaow, Lin, Lu, Zhang, Xianglilan, Li, Shuaicheng
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9906874/
https://www.ncbi.nlm.nih.gov/pubmed/36755234
http://dx.doi.org/10.1186/s12859-023-05161-y
_version_ 1784884058153549824
author Chen, Jiaxing
Ng, Yen Kaow
Lin, Lu
Zhang, Xianglilan
Li, Shuaicheng
author_facet Chen, Jiaxing
Ng, Yen Kaow
Lin, Lu
Zhang, Xianglilan
Li, Shuaicheng
author_sort Chen, Jiaxing
collection PubMed
description BACKGROUND: Distance functions are fundamental for evaluating the differences between gene expression profiles. Such a function would output a low value if the profiles are strongly correlated—either negatively or positively—and vice versa. One popular distance function is the absolute correlation distance, [Formula: see text] , where [Formula: see text] is similarity measure, such as Pearson or Spearman correlation. However, the absolute correlation distance fails to fulfill the triangle inequality, which would have guaranteed better performance at vector quantization, allowed fast data localization, as well as accelerated data clustering. RESULTS: In this work, we propose [Formula: see text] as an alternative. We prove that [Formula: see text] satisfies the triangle inequality when [Formula: see text] represents Pearson correlation, Spearman correlation, or Cosine similarity. We show [Formula: see text] to be better than [Formula: see text] , another variant of [Formula: see text] that satisfies the triangle inequality, both analytically as well as experimentally. We empirically compared [Formula: see text] with [Formula: see text] in gene clustering and sample clustering experiment by real-world biological data. The two distances performed similarly in both gene clustering and sample clustering in hierarchical clustering and PAM (partitioning around medoids) clustering. However, [Formula: see text] demonstrated more robust clustering. According to the bootstrap experiment, [Formula: see text] generated more robust sample pair partition more frequently (P-value [Formula: see text] ). The statistics on the time a class “dissolved” also support the advantage of [Formula: see text] in robustness. CONCLUSION: [Formula: see text] , as a variant of absolute correlation distance, satisfies the triangle inequality and is capable for more robust clustering. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-023-05161-y.
format Online
Article
Text
id pubmed-9906874
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-99068742023-02-08 On triangle inequalities of correlation-based distances for gene expression profiles Chen, Jiaxing Ng, Yen Kaow Lin, Lu Zhang, Xianglilan Li, Shuaicheng BMC Bioinformatics Methodology BACKGROUND: Distance functions are fundamental for evaluating the differences between gene expression profiles. Such a function would output a low value if the profiles are strongly correlated—either negatively or positively—and vice versa. One popular distance function is the absolute correlation distance, [Formula: see text] , where [Formula: see text] is similarity measure, such as Pearson or Spearman correlation. However, the absolute correlation distance fails to fulfill the triangle inequality, which would have guaranteed better performance at vector quantization, allowed fast data localization, as well as accelerated data clustering. RESULTS: In this work, we propose [Formula: see text] as an alternative. We prove that [Formula: see text] satisfies the triangle inequality when [Formula: see text] represents Pearson correlation, Spearman correlation, or Cosine similarity. We show [Formula: see text] to be better than [Formula: see text] , another variant of [Formula: see text] that satisfies the triangle inequality, both analytically as well as experimentally. We empirically compared [Formula: see text] with [Formula: see text] in gene clustering and sample clustering experiment by real-world biological data. The two distances performed similarly in both gene clustering and sample clustering in hierarchical clustering and PAM (partitioning around medoids) clustering. However, [Formula: see text] demonstrated more robust clustering. According to the bootstrap experiment, [Formula: see text] generated more robust sample pair partition more frequently (P-value [Formula: see text] ). The statistics on the time a class “dissolved” also support the advantage of [Formula: see text] in robustness. CONCLUSION: [Formula: see text] , as a variant of absolute correlation distance, satisfies the triangle inequality and is capable for more robust clustering. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-023-05161-y. BioMed Central 2023-02-08 /pmc/articles/PMC9906874/ /pubmed/36755234 http://dx.doi.org/10.1186/s12859-023-05161-y Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Methodology
Chen, Jiaxing
Ng, Yen Kaow
Lin, Lu
Zhang, Xianglilan
Li, Shuaicheng
On triangle inequalities of correlation-based distances for gene expression profiles
title On triangle inequalities of correlation-based distances for gene expression profiles
title_full On triangle inequalities of correlation-based distances for gene expression profiles
title_fullStr On triangle inequalities of correlation-based distances for gene expression profiles
title_full_unstemmed On triangle inequalities of correlation-based distances for gene expression profiles
title_short On triangle inequalities of correlation-based distances for gene expression profiles
title_sort on triangle inequalities of correlation-based distances for gene expression profiles
topic Methodology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9906874/
https://www.ncbi.nlm.nih.gov/pubmed/36755234
http://dx.doi.org/10.1186/s12859-023-05161-y
work_keys_str_mv AT chenjiaxing ontriangleinequalitiesofcorrelationbaseddistancesforgeneexpressionprofiles
AT ngyenkaow ontriangleinequalitiesofcorrelationbaseddistancesforgeneexpressionprofiles
AT linlu ontriangleinequalitiesofcorrelationbaseddistancesforgeneexpressionprofiles
AT zhangxianglilan ontriangleinequalitiesofcorrelationbaseddistancesforgeneexpressionprofiles
AT lishuaicheng ontriangleinequalitiesofcorrelationbaseddistancesforgeneexpressionprofiles