Cargando…

Selecting Representative Samples From Complex Biological Datasets Using K-Medoids Clustering

Rapid growth of single-cell sequencing techniques enables researchers to investigate almost millions of cells with diverse properties in a single experiment. Meanwhile, it also presents great challenges for selecting representative samples from massive single-cell populations for further experimenta...

Descripción completa

Detalles Bibliográficos
Autores principales: Li, Lei, Lan, Linda Yu-Ling, Huang, Lei, Ye, Congting, Andrade, Jorge, Wilson, Patrick C.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9335369/
https://www.ncbi.nlm.nih.gov/pubmed/35910222
http://dx.doi.org/10.3389/fgene.2022.954024
_version_ 1784759324107603968
author Li, Lei
Lan, Linda Yu-Ling
Huang, Lei
Ye, Congting
Andrade, Jorge
Wilson, Patrick C.
author_facet Li, Lei
Lan, Linda Yu-Ling
Huang, Lei
Ye, Congting
Andrade, Jorge
Wilson, Patrick C.
author_sort Li, Lei
collection PubMed
description Rapid growth of single-cell sequencing techniques enables researchers to investigate almost millions of cells with diverse properties in a single experiment. Meanwhile, it also presents great challenges for selecting representative samples from massive single-cell populations for further experimental characterization, which requires a robust and compact sampling with balancing diverse properties of different priority levels. The conventional sampling methods fail to generate representative and generalizable subsets from a massive single-cell population or more complicated ensembles. Here, we present a toolkit called Cookie which can efficiently select out the most representative samples from a massive single-cell population with diverse properties. This method quantifies the relationships/similarities among samples using their Manhattan distances by vectorizing all given properties and then determines an appropriate sample size by evaluating the coverage of key properties from multiple candidate sizes, following by a k-medoids clustering to group samples into several clusters and selects centers from each cluster as the most representatives. Comparison of Cookie with conventional sampling methods using a single-cell atlas dataset, epidemiology surveillance data, and a simulated dataset shows the high efficacy, efficiency, and flexibly of Cookie. The Cookie toolkit is implemented in R and is freely available at https://wilsonimmunologylab.github.io/Cookie/.
format Online
Article
Text
id pubmed-9335369
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-93353692022-07-30 Selecting Representative Samples From Complex Biological Datasets Using K-Medoids Clustering Li, Lei Lan, Linda Yu-Ling Huang, Lei Ye, Congting Andrade, Jorge Wilson, Patrick C. Front Genet Genetics Rapid growth of single-cell sequencing techniques enables researchers to investigate almost millions of cells with diverse properties in a single experiment. Meanwhile, it also presents great challenges for selecting representative samples from massive single-cell populations for further experimental characterization, which requires a robust and compact sampling with balancing diverse properties of different priority levels. The conventional sampling methods fail to generate representative and generalizable subsets from a massive single-cell population or more complicated ensembles. Here, we present a toolkit called Cookie which can efficiently select out the most representative samples from a massive single-cell population with diverse properties. This method quantifies the relationships/similarities among samples using their Manhattan distances by vectorizing all given properties and then determines an appropriate sample size by evaluating the coverage of key properties from multiple candidate sizes, following by a k-medoids clustering to group samples into several clusters and selects centers from each cluster as the most representatives. Comparison of Cookie with conventional sampling methods using a single-cell atlas dataset, epidemiology surveillance data, and a simulated dataset shows the high efficacy, efficiency, and flexibly of Cookie. The Cookie toolkit is implemented in R and is freely available at https://wilsonimmunologylab.github.io/Cookie/. Frontiers Media S.A. 2022-07-18 /pmc/articles/PMC9335369/ /pubmed/35910222 http://dx.doi.org/10.3389/fgene.2022.954024 Text en Copyright © 2022 Li, Lan, Huang, Ye, Andrade and Wilson. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Genetics
Li, Lei
Lan, Linda Yu-Ling
Huang, Lei
Ye, Congting
Andrade, Jorge
Wilson, Patrick C.
Selecting Representative Samples From Complex Biological Datasets Using K-Medoids Clustering
title Selecting Representative Samples From Complex Biological Datasets Using K-Medoids Clustering
title_full Selecting Representative Samples From Complex Biological Datasets Using K-Medoids Clustering
title_fullStr Selecting Representative Samples From Complex Biological Datasets Using K-Medoids Clustering
title_full_unstemmed Selecting Representative Samples From Complex Biological Datasets Using K-Medoids Clustering
title_short Selecting Representative Samples From Complex Biological Datasets Using K-Medoids Clustering
title_sort selecting representative samples from complex biological datasets using k-medoids clustering
topic Genetics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9335369/
https://www.ncbi.nlm.nih.gov/pubmed/35910222
http://dx.doi.org/10.3389/fgene.2022.954024
work_keys_str_mv AT lilei selectingrepresentativesamplesfromcomplexbiologicaldatasetsusingkmedoidsclustering
AT lanlindayuling selectingrepresentativesamplesfromcomplexbiologicaldatasetsusingkmedoidsclustering
AT huanglei selectingrepresentativesamplesfromcomplexbiologicaldatasetsusingkmedoidsclustering
AT yecongting selectingrepresentativesamplesfromcomplexbiologicaldatasetsusingkmedoidsclustering
AT andradejorge selectingrepresentativesamplesfromcomplexbiologicaldatasetsusingkmedoidsclustering
AT wilsonpatrickc selectingrepresentativesamplesfromcomplexbiologicaldatasetsusingkmedoidsclustering