Cargando…

Optimal tuning of weighted kNN- and diffusion-based methods for denoising single cell genomics data

The analysis of single-cell genomics data presents several statistical challenges, and extensive efforts have been made to produce methods for the analysis of this data that impute missing values, address sampling issues and quantify and correct for noise. In spite of such efforts, no consensus on b...

Descripción completa

Detalles Bibliográficos
Autores principales: Tjärnberg, Andreas, Mahmood, Omar, Jackson, Christopher A., Saldi, Giuseppe-Antonio, Cho, Kyunghyun, Christiaen, Lionel A., Bonneau, Richard A.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7817019/
https://www.ncbi.nlm.nih.gov/pubmed/33411784
http://dx.doi.org/10.1371/journal.pcbi.1008569
_version_ 1783638555387494400
author Tjärnberg, Andreas
Mahmood, Omar
Jackson, Christopher A.
Saldi, Giuseppe-Antonio
Cho, Kyunghyun
Christiaen, Lionel A.
Bonneau, Richard A.
author_facet Tjärnberg, Andreas
Mahmood, Omar
Jackson, Christopher A.
Saldi, Giuseppe-Antonio
Cho, Kyunghyun
Christiaen, Lionel A.
Bonneau, Richard A.
author_sort Tjärnberg, Andreas
collection PubMed
description The analysis of single-cell genomics data presents several statistical challenges, and extensive efforts have been made to produce methods for the analysis of this data that impute missing values, address sampling issues and quantify and correct for noise. In spite of such efforts, no consensus on best practices has been established and all current approaches vary substantially based on the available data and empirical tests. The k-Nearest Neighbor Graph (kNN-G) is often used to infer the identities of, and relationships between, cells and is the basis of many widely used dimensionality-reduction and projection methods. The kNN-G has also been the basis for imputation methods using, e.g., neighbor averaging and graph diffusion. However, due to the lack of an agreed-upon optimal objective function for choosing hyperparameters, these methods tend to oversmooth data, thereby resulting in a loss of information with regard to cell identity and the specific gene-to-gene patterns underlying regulatory mechanisms. In this paper, we investigate the tuning of kNN- and diffusion-based denoising methods with a novel non-stochastic method for optimally preserving biologically relevant informative variance in single-cell data. The framework, Denoising Expression data with a Weighted Affinity Kernel and Self-Supervision (DEWÄKSS), uses a self-supervised technique to tune its parameters. We demonstrate that denoising with optimal parameters selected by our objective function (i) is robust to preprocessing methods using data from established benchmarks, (ii) disentangles cellular identity and maintains robust clusters over dimension-reduction methods, (iii) maintains variance along several expression dimensions, unlike previous heuristic-based methods that tend to oversmooth data variance, and (iv) rarely involves diffusion but rather uses a fixed weighted kNN graph for denoising. Together, these findings provide a new understanding of kNN- and diffusion-based denoising methods. Code and example data for DEWÄKSS is available at https://gitlab.com/Xparx/dewakss/-/tree/Tjarnberg2020branch.
format Online
Article
Text
id pubmed-7817019
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-78170192021-01-28 Optimal tuning of weighted kNN- and diffusion-based methods for denoising single cell genomics data Tjärnberg, Andreas Mahmood, Omar Jackson, Christopher A. Saldi, Giuseppe-Antonio Cho, Kyunghyun Christiaen, Lionel A. Bonneau, Richard A. PLoS Comput Biol Research Article The analysis of single-cell genomics data presents several statistical challenges, and extensive efforts have been made to produce methods for the analysis of this data that impute missing values, address sampling issues and quantify and correct for noise. In spite of such efforts, no consensus on best practices has been established and all current approaches vary substantially based on the available data and empirical tests. The k-Nearest Neighbor Graph (kNN-G) is often used to infer the identities of, and relationships between, cells and is the basis of many widely used dimensionality-reduction and projection methods. The kNN-G has also been the basis for imputation methods using, e.g., neighbor averaging and graph diffusion. However, due to the lack of an agreed-upon optimal objective function for choosing hyperparameters, these methods tend to oversmooth data, thereby resulting in a loss of information with regard to cell identity and the specific gene-to-gene patterns underlying regulatory mechanisms. In this paper, we investigate the tuning of kNN- and diffusion-based denoising methods with a novel non-stochastic method for optimally preserving biologically relevant informative variance in single-cell data. The framework, Denoising Expression data with a Weighted Affinity Kernel and Self-Supervision (DEWÄKSS), uses a self-supervised technique to tune its parameters. We demonstrate that denoising with optimal parameters selected by our objective function (i) is robust to preprocessing methods using data from established benchmarks, (ii) disentangles cellular identity and maintains robust clusters over dimension-reduction methods, (iii) maintains variance along several expression dimensions, unlike previous heuristic-based methods that tend to oversmooth data variance, and (iv) rarely involves diffusion but rather uses a fixed weighted kNN graph for denoising. Together, these findings provide a new understanding of kNN- and diffusion-based denoising methods. Code and example data for DEWÄKSS is available at https://gitlab.com/Xparx/dewakss/-/tree/Tjarnberg2020branch. Public Library of Science 2021-01-07 /pmc/articles/PMC7817019/ /pubmed/33411784 http://dx.doi.org/10.1371/journal.pcbi.1008569 Text en © 2021 Tjärnberg et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Tjärnberg, Andreas
Mahmood, Omar
Jackson, Christopher A.
Saldi, Giuseppe-Antonio
Cho, Kyunghyun
Christiaen, Lionel A.
Bonneau, Richard A.
Optimal tuning of weighted kNN- and diffusion-based methods for denoising single cell genomics data
title Optimal tuning of weighted kNN- and diffusion-based methods for denoising single cell genomics data
title_full Optimal tuning of weighted kNN- and diffusion-based methods for denoising single cell genomics data
title_fullStr Optimal tuning of weighted kNN- and diffusion-based methods for denoising single cell genomics data
title_full_unstemmed Optimal tuning of weighted kNN- and diffusion-based methods for denoising single cell genomics data
title_short Optimal tuning of weighted kNN- and diffusion-based methods for denoising single cell genomics data
title_sort optimal tuning of weighted knn- and diffusion-based methods for denoising single cell genomics data
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7817019/
https://www.ncbi.nlm.nih.gov/pubmed/33411784
http://dx.doi.org/10.1371/journal.pcbi.1008569
work_keys_str_mv AT tjarnbergandreas optimaltuningofweightedknnanddiffusionbasedmethodsfordenoisingsinglecellgenomicsdata
AT mahmoodomar optimaltuningofweightedknnanddiffusionbasedmethodsfordenoisingsinglecellgenomicsdata
AT jacksonchristophera optimaltuningofweightedknnanddiffusionbasedmethodsfordenoisingsinglecellgenomicsdata
AT saldigiuseppeantonio optimaltuningofweightedknnanddiffusionbasedmethodsfordenoisingsinglecellgenomicsdata
AT chokyunghyun optimaltuningofweightedknnanddiffusionbasedmethodsfordenoisingsinglecellgenomicsdata
AT christiaenlionela optimaltuningofweightedknnanddiffusionbasedmethodsfordenoisingsinglecellgenomicsdata
AT bonneauricharda optimaltuningofweightedknnanddiffusionbasedmethodsfordenoisingsinglecellgenomicsdata