Cargando…

Biology-inspired data-driven quality control for scientific discovery in single-cell transcriptomics

BACKGROUND: Quality control (QC) of cells, a critical first step in single-cell RNA sequencing data analysis, has largely relied on arbitrarily fixed data-agnostic thresholds applied to QC metrics such as gene complexity and fraction of reads mapping to mitochondrial genes. The few existing data-dri...

Descripción completa

Detalles Bibliográficos
Autores principales: Subramanian, Ayshwarya, Alperovich, Mikhail, Yang, Yiming, Li, Bo
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9793662/
https://www.ncbi.nlm.nih.gov/pubmed/36575523
http://dx.doi.org/10.1186/s13059-022-02820-w
_version_ 1784859884516278272
author Subramanian, Ayshwarya
Alperovich, Mikhail
Yang, Yiming
Li, Bo
author_facet Subramanian, Ayshwarya
Alperovich, Mikhail
Yang, Yiming
Li, Bo
author_sort Subramanian, Ayshwarya
collection PubMed
description BACKGROUND: Quality control (QC) of cells, a critical first step in single-cell RNA sequencing data analysis, has largely relied on arbitrarily fixed data-agnostic thresholds applied to QC metrics such as gene complexity and fraction of reads mapping to mitochondrial genes. The few existing data-driven approaches perform QC at the level of samples or studies without accounting for biological variation. RESULTS: We first demonstrate that QC metrics vary with both tissue and cell types across technologies, study conditions, and species. We then propose data-driven QC (ddqc), an unsupervised adaptive QC framework to perform flexible and data-driven QC at the level of cell types while retaining critical biological insights and improved power for downstream analysis. ddqc applies an adaptive threshold based on the median absolute deviation on four QC metrics (gene and UMI complexity, fraction of reads mapping to mitochondrial and ribosomal genes). ddqc retains over a third more cells when compared to conventional data-agnostic QC filters. Finally, we show that ddqc recovers biologically meaningful trends in gradation of gene complexity among cell types that can help answer questions of biological interest such as which cell types express the least and most number of transcripts overall, and ribosomal transcripts specifically. CONCLUSIONS: ddqc retains cell types such as metabolically active parenchymal cells and specialized cells such as neutrophils which are often lost by conventional QC. Taken together, our work proposes a revised paradigm to quality filtering best practices—iterative QC, providing a data-driven QC framework compatible with observed biological diversity. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13059-022-02820-w.
format Online
Article
Text
id pubmed-9793662
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-97936622022-12-28 Biology-inspired data-driven quality control for scientific discovery in single-cell transcriptomics Subramanian, Ayshwarya Alperovich, Mikhail Yang, Yiming Li, Bo Genome Biol Research BACKGROUND: Quality control (QC) of cells, a critical first step in single-cell RNA sequencing data analysis, has largely relied on arbitrarily fixed data-agnostic thresholds applied to QC metrics such as gene complexity and fraction of reads mapping to mitochondrial genes. The few existing data-driven approaches perform QC at the level of samples or studies without accounting for biological variation. RESULTS: We first demonstrate that QC metrics vary with both tissue and cell types across technologies, study conditions, and species. We then propose data-driven QC (ddqc), an unsupervised adaptive QC framework to perform flexible and data-driven QC at the level of cell types while retaining critical biological insights and improved power for downstream analysis. ddqc applies an adaptive threshold based on the median absolute deviation on four QC metrics (gene and UMI complexity, fraction of reads mapping to mitochondrial and ribosomal genes). ddqc retains over a third more cells when compared to conventional data-agnostic QC filters. Finally, we show that ddqc recovers biologically meaningful trends in gradation of gene complexity among cell types that can help answer questions of biological interest such as which cell types express the least and most number of transcripts overall, and ribosomal transcripts specifically. CONCLUSIONS: ddqc retains cell types such as metabolically active parenchymal cells and specialized cells such as neutrophils which are often lost by conventional QC. Taken together, our work proposes a revised paradigm to quality filtering best practices—iterative QC, providing a data-driven QC framework compatible with observed biological diversity. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13059-022-02820-w. BioMed Central 2022-12-27 /pmc/articles/PMC9793662/ /pubmed/36575523 http://dx.doi.org/10.1186/s13059-022-02820-w Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Subramanian, Ayshwarya
Alperovich, Mikhail
Yang, Yiming
Li, Bo
Biology-inspired data-driven quality control for scientific discovery in single-cell transcriptomics
title Biology-inspired data-driven quality control for scientific discovery in single-cell transcriptomics
title_full Biology-inspired data-driven quality control for scientific discovery in single-cell transcriptomics
title_fullStr Biology-inspired data-driven quality control for scientific discovery in single-cell transcriptomics
title_full_unstemmed Biology-inspired data-driven quality control for scientific discovery in single-cell transcriptomics
title_short Biology-inspired data-driven quality control for scientific discovery in single-cell transcriptomics
title_sort biology-inspired data-driven quality control for scientific discovery in single-cell transcriptomics
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9793662/
https://www.ncbi.nlm.nih.gov/pubmed/36575523
http://dx.doi.org/10.1186/s13059-022-02820-w
work_keys_str_mv AT subramanianayshwarya biologyinspireddatadrivenqualitycontrolforscientificdiscoveryinsinglecelltranscriptomics
AT alperovichmikhail biologyinspireddatadrivenqualitycontrolforscientificdiscoveryinsinglecelltranscriptomics
AT yangyiming biologyinspireddatadrivenqualitycontrolforscientificdiscoveryinsinglecelltranscriptomics
AT libo biologyinspireddatadrivenqualitycontrolforscientificdiscoveryinsinglecelltranscriptomics