Cargando…

Scellpam: an R package/C++ library to perform parallel partitioning around medoids on scRNAseq data sets

BACKGROUND: Partitioning around medoids (PAM) is one of the most widely used and successful clustering method in many fields. One of its key advantages is that it only requires a distance or a dissimilarity between the individuals, and the fact that cluster centers are actual points in the data set...

Descripción completa

Detalles Bibliográficos
Autores principales: Domingo, Juan, Leon, Teresa, Dura, Esther
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10503022/
https://www.ncbi.nlm.nih.gov/pubmed/37710192
http://dx.doi.org/10.1186/s12859-023-05471-1
_version_ 1785106435120562176
author Domingo, Juan
Leon, Teresa
Dura, Esther
author_facet Domingo, Juan
Leon, Teresa
Dura, Esther
author_sort Domingo, Juan
collection PubMed
description BACKGROUND: Partitioning around medoids (PAM) is one of the most widely used and successful clustering method in many fields. One of its key advantages is that it only requires a distance or a dissimilarity between the individuals, and the fact that cluster centers are actual points in the data set means they can be taken as reliable representatives of their classes. However, its wider application is hampered by the large amount of memory needed to store the distance matrix (quadratic on the number of individuals) and also by the high computational cost of computing such distance matrix and, less importantly, by the cost of the clustering algorithm itself. RESULTS: Therefore, new software has been provided that addresses these issues. This software, provided under GPL license and usable as either an R package or a C++ library, calculates in parallel the distance matrix for different distances/dissimilarities ([Formula: see text] , [Formula: see text] , Pearson, cosine and weighted Euclidean) and also implements a parallel fast version of PAM (FASTPAM1) using any data type to reduce memory usage. Moreover, the parallel implementation uses all the cores available in modern computers which greatly reduces the execution time. Besides its general application, the software is especially useful for processing data of single cell experiments. It has been tested in problems including clustering of single cell experiments with up to 289,000 cells with the expression of about 29,000 genes per cell. CONCLUSIONS: Comparisons with other current packages in terms of execution time have been made. The method greatly outperforms the available R packages for distance matrix calculation and also improves the packages that implement the PAM itself. The software is available as an R package at https://CRAN.R-project.org/package=scellpam and as C++ libraries at https://github.com/JdMDE/jmatlib and https://github.com/JdMDE/ppamlib The package is useful for single cell RNA-seq studies but it is also applicable in other contexts where clustering of large data sets is required. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-023-05471-1.
format Online
Article
Text
id pubmed-10503022
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-105030222023-09-16 Scellpam: an R package/C++ library to perform parallel partitioning around medoids on scRNAseq data sets Domingo, Juan Leon, Teresa Dura, Esther BMC Bioinformatics Software BACKGROUND: Partitioning around medoids (PAM) is one of the most widely used and successful clustering method in many fields. One of its key advantages is that it only requires a distance or a dissimilarity between the individuals, and the fact that cluster centers are actual points in the data set means they can be taken as reliable representatives of their classes. However, its wider application is hampered by the large amount of memory needed to store the distance matrix (quadratic on the number of individuals) and also by the high computational cost of computing such distance matrix and, less importantly, by the cost of the clustering algorithm itself. RESULTS: Therefore, new software has been provided that addresses these issues. This software, provided under GPL license and usable as either an R package or a C++ library, calculates in parallel the distance matrix for different distances/dissimilarities ([Formula: see text] , [Formula: see text] , Pearson, cosine and weighted Euclidean) and also implements a parallel fast version of PAM (FASTPAM1) using any data type to reduce memory usage. Moreover, the parallel implementation uses all the cores available in modern computers which greatly reduces the execution time. Besides its general application, the software is especially useful for processing data of single cell experiments. It has been tested in problems including clustering of single cell experiments with up to 289,000 cells with the expression of about 29,000 genes per cell. CONCLUSIONS: Comparisons with other current packages in terms of execution time have been made. The method greatly outperforms the available R packages for distance matrix calculation and also improves the packages that implement the PAM itself. The software is available as an R package at https://CRAN.R-project.org/package=scellpam and as C++ libraries at https://github.com/JdMDE/jmatlib and https://github.com/JdMDE/ppamlib The package is useful for single cell RNA-seq studies but it is also applicable in other contexts where clustering of large data sets is required. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-023-05471-1. BioMed Central 2023-09-14 /pmc/articles/PMC10503022/ /pubmed/37710192 http://dx.doi.org/10.1186/s12859-023-05471-1 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Software
Domingo, Juan
Leon, Teresa
Dura, Esther
Scellpam: an R package/C++ library to perform parallel partitioning around medoids on scRNAseq data sets
title Scellpam: an R package/C++ library to perform parallel partitioning around medoids on scRNAseq data sets
title_full Scellpam: an R package/C++ library to perform parallel partitioning around medoids on scRNAseq data sets
title_fullStr Scellpam: an R package/C++ library to perform parallel partitioning around medoids on scRNAseq data sets
title_full_unstemmed Scellpam: an R package/C++ library to perform parallel partitioning around medoids on scRNAseq data sets
title_short Scellpam: an R package/C++ library to perform parallel partitioning around medoids on scRNAseq data sets
title_sort scellpam: an r package/c++ library to perform parallel partitioning around medoids on scrnaseq data sets
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10503022/
https://www.ncbi.nlm.nih.gov/pubmed/37710192
http://dx.doi.org/10.1186/s12859-023-05471-1
work_keys_str_mv AT domingojuan scellpamanrpackageclibrarytoperformparallelpartitioningaroundmedoidsonscrnaseqdatasets
AT leonteresa scellpamanrpackageclibrarytoperformparallelpartitioningaroundmedoidsonscrnaseqdatasets
AT duraesther scellpamanrpackageclibrarytoperformparallelpartitioningaroundmedoidsonscrnaseqdatasets