Cargando…
Scellpam: an R package/C++ library to perform parallel partitioning around medoids on scRNAseq data sets
BACKGROUND: Partitioning around medoids (PAM) is one of the most widely used and successful clustering method in many fields. One of its key advantages is that it only requires a distance or a dissimilarity between the individuals, and the fact that cluster centers are actual points in the data set...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10503022/ https://www.ncbi.nlm.nih.gov/pubmed/37710192 http://dx.doi.org/10.1186/s12859-023-05471-1 |
_version_ | 1785106435120562176 |
---|---|
author | Domingo, Juan Leon, Teresa Dura, Esther |
author_facet | Domingo, Juan Leon, Teresa Dura, Esther |
author_sort | Domingo, Juan |
collection | PubMed |
description | BACKGROUND: Partitioning around medoids (PAM) is one of the most widely used and successful clustering method in many fields. One of its key advantages is that it only requires a distance or a dissimilarity between the individuals, and the fact that cluster centers are actual points in the data set means they can be taken as reliable representatives of their classes. However, its wider application is hampered by the large amount of memory needed to store the distance matrix (quadratic on the number of individuals) and also by the high computational cost of computing such distance matrix and, less importantly, by the cost of the clustering algorithm itself. RESULTS: Therefore, new software has been provided that addresses these issues. This software, provided under GPL license and usable as either an R package or a C++ library, calculates in parallel the distance matrix for different distances/dissimilarities ([Formula: see text] , [Formula: see text] , Pearson, cosine and weighted Euclidean) and also implements a parallel fast version of PAM (FASTPAM1) using any data type to reduce memory usage. Moreover, the parallel implementation uses all the cores available in modern computers which greatly reduces the execution time. Besides its general application, the software is especially useful for processing data of single cell experiments. It has been tested in problems including clustering of single cell experiments with up to 289,000 cells with the expression of about 29,000 genes per cell. CONCLUSIONS: Comparisons with other current packages in terms of execution time have been made. The method greatly outperforms the available R packages for distance matrix calculation and also improves the packages that implement the PAM itself. The software is available as an R package at https://CRAN.R-project.org/package=scellpam and as C++ libraries at https://github.com/JdMDE/jmatlib and https://github.com/JdMDE/ppamlib The package is useful for single cell RNA-seq studies but it is also applicable in other contexts where clustering of large data sets is required. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-023-05471-1. |
format | Online Article Text |
id | pubmed-10503022 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-105030222023-09-16 Scellpam: an R package/C++ library to perform parallel partitioning around medoids on scRNAseq data sets Domingo, Juan Leon, Teresa Dura, Esther BMC Bioinformatics Software BACKGROUND: Partitioning around medoids (PAM) is one of the most widely used and successful clustering method in many fields. One of its key advantages is that it only requires a distance or a dissimilarity between the individuals, and the fact that cluster centers are actual points in the data set means they can be taken as reliable representatives of their classes. However, its wider application is hampered by the large amount of memory needed to store the distance matrix (quadratic on the number of individuals) and also by the high computational cost of computing such distance matrix and, less importantly, by the cost of the clustering algorithm itself. RESULTS: Therefore, new software has been provided that addresses these issues. This software, provided under GPL license and usable as either an R package or a C++ library, calculates in parallel the distance matrix for different distances/dissimilarities ([Formula: see text] , [Formula: see text] , Pearson, cosine and weighted Euclidean) and also implements a parallel fast version of PAM (FASTPAM1) using any data type to reduce memory usage. Moreover, the parallel implementation uses all the cores available in modern computers which greatly reduces the execution time. Besides its general application, the software is especially useful for processing data of single cell experiments. It has been tested in problems including clustering of single cell experiments with up to 289,000 cells with the expression of about 29,000 genes per cell. CONCLUSIONS: Comparisons with other current packages in terms of execution time have been made. The method greatly outperforms the available R packages for distance matrix calculation and also improves the packages that implement the PAM itself. The software is available as an R package at https://CRAN.R-project.org/package=scellpam and as C++ libraries at https://github.com/JdMDE/jmatlib and https://github.com/JdMDE/ppamlib The package is useful for single cell RNA-seq studies but it is also applicable in other contexts where clustering of large data sets is required. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-023-05471-1. BioMed Central 2023-09-14 /pmc/articles/PMC10503022/ /pubmed/37710192 http://dx.doi.org/10.1186/s12859-023-05471-1 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Software Domingo, Juan Leon, Teresa Dura, Esther Scellpam: an R package/C++ library to perform parallel partitioning around medoids on scRNAseq data sets |
title | Scellpam: an R package/C++ library to perform parallel partitioning around medoids on scRNAseq data sets |
title_full | Scellpam: an R package/C++ library to perform parallel partitioning around medoids on scRNAseq data sets |
title_fullStr | Scellpam: an R package/C++ library to perform parallel partitioning around medoids on scRNAseq data sets |
title_full_unstemmed | Scellpam: an R package/C++ library to perform parallel partitioning around medoids on scRNAseq data sets |
title_short | Scellpam: an R package/C++ library to perform parallel partitioning around medoids on scRNAseq data sets |
title_sort | scellpam: an r package/c++ library to perform parallel partitioning around medoids on scrnaseq data sets |
topic | Software |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10503022/ https://www.ncbi.nlm.nih.gov/pubmed/37710192 http://dx.doi.org/10.1186/s12859-023-05471-1 |
work_keys_str_mv | AT domingojuan scellpamanrpackageclibrarytoperformparallelpartitioningaroundmedoidsonscrnaseqdatasets AT leonteresa scellpamanrpackageclibrarytoperformparallelpartitioningaroundmedoidsonscrnaseqdatasets AT duraesther scellpamanrpackageclibrarytoperformparallelpartitioningaroundmedoidsonscrnaseqdatasets |