Cargando…
Euclidean distance-optimized data transformation for cluster analysis in biomedical data (EDOtrans)
BACKGROUND: Data transformations are commonly used in bioinformatics data processing in the context of data projection and clustering. The most used Euclidean metric is not scale invariant and therefore occasionally inappropriate for complex, e.g., multimodal distributed variables and may negatively...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9202178/ https://www.ncbi.nlm.nih.gov/pubmed/35710346 http://dx.doi.org/10.1186/s12859-022-04769-w |
_version_ | 1784728476613345280 |
---|---|
author | Ultsch, Alfred Lötsch, Jörn |
author_facet | Ultsch, Alfred Lötsch, Jörn |
author_sort | Ultsch, Alfred |
collection | PubMed |
description | BACKGROUND: Data transformations are commonly used in bioinformatics data processing in the context of data projection and clustering. The most used Euclidean metric is not scale invariant and therefore occasionally inappropriate for complex, e.g., multimodal distributed variables and may negatively affect the results of cluster analysis. Specifically, the squaring function in the definition of the Euclidean distance as the square root of the sum of squared differences between data points has the consequence that the value 1 implicitly defines a limit for distances within clusters versus distances between (inter-) clusters. METHODS: The Euclidean distances within a standard normal distribution (N(0,1)) follow a N(0,[Formula: see text] ) distribution. The EDO-transformation of a variable X is proposed as [Formula: see text] following modeling of the standard deviation s by a mixture of Gaussians and selecting the dominant modes via item categorization. The method was compared in artificial and biomedical datasets with clustering of untransformed data, z-transformed data, and the recently proposed pooled variable scaling. RESULTS: A simulation study and applications to known real data examples showed that the proposed EDO scaling method is generally useful. The clustering results in terms of cluster accuracy, adjusted Rand index and Dunn’s index outperformed the classical alternatives. Finally, the EDO transformation was applied to cluster a high-dimensional genomic dataset consisting of gene expression data for multiple samples of breast cancer tissues, and the proposed approach gave better results than classical methods and was compared with pooled variable scaling. CONCLUSIONS: For multivariate procedures of data analysis, it is proposed to use the EDO transformation as a better alternative to the established z-standardization, especially for nontrivially distributed data. The “EDOtrans” R package is available at https://cran.r-project.org/package=EDOtrans. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-022-04769-w. |
format | Online Article Text |
id | pubmed-9202178 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-92021782022-06-17 Euclidean distance-optimized data transformation for cluster analysis in biomedical data (EDOtrans) Ultsch, Alfred Lötsch, Jörn BMC Bioinformatics Research BACKGROUND: Data transformations are commonly used in bioinformatics data processing in the context of data projection and clustering. The most used Euclidean metric is not scale invariant and therefore occasionally inappropriate for complex, e.g., multimodal distributed variables and may negatively affect the results of cluster analysis. Specifically, the squaring function in the definition of the Euclidean distance as the square root of the sum of squared differences between data points has the consequence that the value 1 implicitly defines a limit for distances within clusters versus distances between (inter-) clusters. METHODS: The Euclidean distances within a standard normal distribution (N(0,1)) follow a N(0,[Formula: see text] ) distribution. The EDO-transformation of a variable X is proposed as [Formula: see text] following modeling of the standard deviation s by a mixture of Gaussians and selecting the dominant modes via item categorization. The method was compared in artificial and biomedical datasets with clustering of untransformed data, z-transformed data, and the recently proposed pooled variable scaling. RESULTS: A simulation study and applications to known real data examples showed that the proposed EDO scaling method is generally useful. The clustering results in terms of cluster accuracy, adjusted Rand index and Dunn’s index outperformed the classical alternatives. Finally, the EDO transformation was applied to cluster a high-dimensional genomic dataset consisting of gene expression data for multiple samples of breast cancer tissues, and the proposed approach gave better results than classical methods and was compared with pooled variable scaling. CONCLUSIONS: For multivariate procedures of data analysis, it is proposed to use the EDO transformation as a better alternative to the established z-standardization, especially for nontrivially distributed data. The “EDOtrans” R package is available at https://cran.r-project.org/package=EDOtrans. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-022-04769-w. BioMed Central 2022-06-16 /pmc/articles/PMC9202178/ /pubmed/35710346 http://dx.doi.org/10.1186/s12859-022-04769-w Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Research Ultsch, Alfred Lötsch, Jörn Euclidean distance-optimized data transformation for cluster analysis in biomedical data (EDOtrans) |
title | Euclidean distance-optimized data transformation for cluster analysis in biomedical data (EDOtrans) |
title_full | Euclidean distance-optimized data transformation for cluster analysis in biomedical data (EDOtrans) |
title_fullStr | Euclidean distance-optimized data transformation for cluster analysis in biomedical data (EDOtrans) |
title_full_unstemmed | Euclidean distance-optimized data transformation for cluster analysis in biomedical data (EDOtrans) |
title_short | Euclidean distance-optimized data transformation for cluster analysis in biomedical data (EDOtrans) |
title_sort | euclidean distance-optimized data transformation for cluster analysis in biomedical data (edotrans) |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9202178/ https://www.ncbi.nlm.nih.gov/pubmed/35710346 http://dx.doi.org/10.1186/s12859-022-04769-w |
work_keys_str_mv | AT ultschalfred euclideandistanceoptimizeddatatransformationforclusteranalysisinbiomedicaldataedotrans AT lotschjorn euclideandistanceoptimizeddatatransformationforclusteranalysisinbiomedicaldataedotrans |