Cargando…

DUBStepR is a scalable correlation-based feature selection method for accurately clustering single-cell data

Feature selection (marker gene selection) is widely believed to improve clustering accuracy, and is thus a key component of single cell clustering pipelines. Existing feature selection methods perform inconsistently across datasets, occasionally even resulting in poorer clustering accuracy than with...

Descripción completa

Detalles Bibliográficos
Autores principales: Ranjan, Bobby, Sun, Wenjie, Park, Jinyu, Mishra, Kunal, Schmidt, Florian, Xie, Ronald, Alipour, Fatemeh, Singhal, Vipul, Joanito, Ignasius, Honardoost, Mohammad Amin, Yong, Jacy Mei Yun, Koh, Ee Tzun, Leong, Khai Pang, Rayan, Nirmala Arul, Lim, Michelle Gek Liang, Prabhakar, Shyam
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group UK 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8494900/
https://www.ncbi.nlm.nih.gov/pubmed/34615861
http://dx.doi.org/10.1038/s41467-021-26085-2
_version_ 1784579411718176768
author Ranjan, Bobby
Sun, Wenjie
Park, Jinyu
Mishra, Kunal
Schmidt, Florian
Xie, Ronald
Alipour, Fatemeh
Singhal, Vipul
Joanito, Ignasius
Honardoost, Mohammad Amin
Yong, Jacy Mei Yun
Koh, Ee Tzun
Leong, Khai Pang
Rayan, Nirmala Arul
Lim, Michelle Gek Liang
Prabhakar, Shyam
author_facet Ranjan, Bobby
Sun, Wenjie
Park, Jinyu
Mishra, Kunal
Schmidt, Florian
Xie, Ronald
Alipour, Fatemeh
Singhal, Vipul
Joanito, Ignasius
Honardoost, Mohammad Amin
Yong, Jacy Mei Yun
Koh, Ee Tzun
Leong, Khai Pang
Rayan, Nirmala Arul
Lim, Michelle Gek Liang
Prabhakar, Shyam
author_sort Ranjan, Bobby
collection PubMed
description Feature selection (marker gene selection) is widely believed to improve clustering accuracy, and is thus a key component of single cell clustering pipelines. Existing feature selection methods perform inconsistently across datasets, occasionally even resulting in poorer clustering accuracy than without feature selection. Moreover, existing methods ignore information contained in gene-gene correlations. Here, we introduce DUBStepR (Determining the Underlying Basis using Stepwise Regression), a feature selection algorithm that leverages gene-gene correlations with a novel measure of inhomogeneity in feature space, termed the Density Index (DI). Despite selecting a relatively small number of genes, DUBStepR substantially outperformed existing single-cell feature selection methods across diverse clustering benchmarks. Additionally, DUBStepR was the only method to robustly deconvolve T and NK heterogeneity by identifying disease-associated common and rare cell types and subtypes in PBMCs from rheumatoid arthritis patients. DUBStepR is scalable to over a million cells, and can be straightforwardly applied to other data types such as single-cell ATAC-seq. We propose DUBStepR as a general-purpose feature selection solution for accurately clustering single-cell data.
format Online
Article
Text
id pubmed-8494900
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Nature Publishing Group UK
record_format MEDLINE/PubMed
spelling pubmed-84949002021-10-07 DUBStepR is a scalable correlation-based feature selection method for accurately clustering single-cell data Ranjan, Bobby Sun, Wenjie Park, Jinyu Mishra, Kunal Schmidt, Florian Xie, Ronald Alipour, Fatemeh Singhal, Vipul Joanito, Ignasius Honardoost, Mohammad Amin Yong, Jacy Mei Yun Koh, Ee Tzun Leong, Khai Pang Rayan, Nirmala Arul Lim, Michelle Gek Liang Prabhakar, Shyam Nat Commun Article Feature selection (marker gene selection) is widely believed to improve clustering accuracy, and is thus a key component of single cell clustering pipelines. Existing feature selection methods perform inconsistently across datasets, occasionally even resulting in poorer clustering accuracy than without feature selection. Moreover, existing methods ignore information contained in gene-gene correlations. Here, we introduce DUBStepR (Determining the Underlying Basis using Stepwise Regression), a feature selection algorithm that leverages gene-gene correlations with a novel measure of inhomogeneity in feature space, termed the Density Index (DI). Despite selecting a relatively small number of genes, DUBStepR substantially outperformed existing single-cell feature selection methods across diverse clustering benchmarks. Additionally, DUBStepR was the only method to robustly deconvolve T and NK heterogeneity by identifying disease-associated common and rare cell types and subtypes in PBMCs from rheumatoid arthritis patients. DUBStepR is scalable to over a million cells, and can be straightforwardly applied to other data types such as single-cell ATAC-seq. We propose DUBStepR as a general-purpose feature selection solution for accurately clustering single-cell data. Nature Publishing Group UK 2021-10-06 /pmc/articles/PMC8494900/ /pubmed/34615861 http://dx.doi.org/10.1038/s41467-021-26085-2 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Article
Ranjan, Bobby
Sun, Wenjie
Park, Jinyu
Mishra, Kunal
Schmidt, Florian
Xie, Ronald
Alipour, Fatemeh
Singhal, Vipul
Joanito, Ignasius
Honardoost, Mohammad Amin
Yong, Jacy Mei Yun
Koh, Ee Tzun
Leong, Khai Pang
Rayan, Nirmala Arul
Lim, Michelle Gek Liang
Prabhakar, Shyam
DUBStepR is a scalable correlation-based feature selection method for accurately clustering single-cell data
title DUBStepR is a scalable correlation-based feature selection method for accurately clustering single-cell data
title_full DUBStepR is a scalable correlation-based feature selection method for accurately clustering single-cell data
title_fullStr DUBStepR is a scalable correlation-based feature selection method for accurately clustering single-cell data
title_full_unstemmed DUBStepR is a scalable correlation-based feature selection method for accurately clustering single-cell data
title_short DUBStepR is a scalable correlation-based feature selection method for accurately clustering single-cell data
title_sort dubstepr is a scalable correlation-based feature selection method for accurately clustering single-cell data
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8494900/
https://www.ncbi.nlm.nih.gov/pubmed/34615861
http://dx.doi.org/10.1038/s41467-021-26085-2
work_keys_str_mv AT ranjanbobby dubsteprisascalablecorrelationbasedfeatureselectionmethodforaccuratelyclusteringsinglecelldata
AT sunwenjie dubsteprisascalablecorrelationbasedfeatureselectionmethodforaccuratelyclusteringsinglecelldata
AT parkjinyu dubsteprisascalablecorrelationbasedfeatureselectionmethodforaccuratelyclusteringsinglecelldata
AT mishrakunal dubsteprisascalablecorrelationbasedfeatureselectionmethodforaccuratelyclusteringsinglecelldata
AT schmidtflorian dubsteprisascalablecorrelationbasedfeatureselectionmethodforaccuratelyclusteringsinglecelldata
AT xieronald dubsteprisascalablecorrelationbasedfeatureselectionmethodforaccuratelyclusteringsinglecelldata
AT alipourfatemeh dubsteprisascalablecorrelationbasedfeatureselectionmethodforaccuratelyclusteringsinglecelldata
AT singhalvipul dubsteprisascalablecorrelationbasedfeatureselectionmethodforaccuratelyclusteringsinglecelldata
AT joanitoignasius dubsteprisascalablecorrelationbasedfeatureselectionmethodforaccuratelyclusteringsinglecelldata
AT honardoostmohammadamin dubsteprisascalablecorrelationbasedfeatureselectionmethodforaccuratelyclusteringsinglecelldata
AT yongjacymeiyun dubsteprisascalablecorrelationbasedfeatureselectionmethodforaccuratelyclusteringsinglecelldata
AT koheetzun dubsteprisascalablecorrelationbasedfeatureselectionmethodforaccuratelyclusteringsinglecelldata
AT leongkhaipang dubsteprisascalablecorrelationbasedfeatureselectionmethodforaccuratelyclusteringsinglecelldata
AT rayannirmalaarul dubsteprisascalablecorrelationbasedfeatureselectionmethodforaccuratelyclusteringsinglecelldata
AT limmichellegekliang dubsteprisascalablecorrelationbasedfeatureselectionmethodforaccuratelyclusteringsinglecelldata
AT prabhakarshyam dubsteprisascalablecorrelationbasedfeatureselectionmethodforaccuratelyclusteringsinglecelldata