Cargando…

Influence of single-cell RNA sequencing data integration on the performance of differential gene expression analysis

Large-scale comprehensive single-cell experiments are often resource-intensive and require the involvement of many laboratories and/or taking measurements at various times. This inevitably leads to batch effects, and systematic variations in the data that might occur due to different technology plat...

Descripción completa

Detalles Bibliográficos
Autores principales: Kujawa, Tomasz, Marczyk, Michał, Polanska, Joanna
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9663917/
https://www.ncbi.nlm.nih.gov/pubmed/36386846
http://dx.doi.org/10.3389/fgene.2022.1009316
_version_ 1784830987471945728
author Kujawa, Tomasz
Marczyk, Michał
Polanska, Joanna
author_facet Kujawa, Tomasz
Marczyk, Michał
Polanska, Joanna
author_sort Kujawa, Tomasz
collection PubMed
description Large-scale comprehensive single-cell experiments are often resource-intensive and require the involvement of many laboratories and/or taking measurements at various times. This inevitably leads to batch effects, and systematic variations in the data that might occur due to different technology platforms, reagent lots, or handling personnel. Such technical differences confound biological variations of interest and need to be corrected during the data integration process. Data integration is a challenging task due to the overlapping of biological and technical factors, which makes it difficult to distinguish their individual contribution to the overall observed effect. Moreover, the choice of integration method may impact the downstream analyses, including searching for differentially expressed genes. From the existing data integration methods, we selected only those that return the full expression matrix. We evaluated six methods in terms of their influence on the performance of differential gene expression analysis in two single-cell datasets with the same biological study design that differ only in the way the measurement was done: one dataset manifests strong batch effects due to the measurements of each sample at a different time. Integrated data were visualized using the UMAP method. The evaluation was done both on individual gene level using parametric and non-parametric approaches for finding differentially expressed genes and on gene set level using gene set enrichment analysis. As an evaluation metric, we used two correlation coefficients, Pearson and Spearman, of the obtained test statistics between reference, test, and corrected studies. Visual comparison of UMAP plots highlighted ComBat-seq, limma, and MNN, which reduced batch effects and preserved differences between biological conditions. Most of the tested methods changed the data distribution after integration, which negatively impacts the use of parametric methods for the analysis. Two algorithms, MNN and Scanorama, gave very poor results in terms of differential analysis on gene and gene set levels. Finally, we highlight ComBat-seq as it led to the highest correlation of test statistics between reference and corrected dataset among others. Moreover, it does not distort the original distribution of gene expression data, so it can be used in all types of downstream analyses.
format Online
Article
Text
id pubmed-9663917
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-96639172022-11-15 Influence of single-cell RNA sequencing data integration on the performance of differential gene expression analysis Kujawa, Tomasz Marczyk, Michał Polanska, Joanna Front Genet Genetics Large-scale comprehensive single-cell experiments are often resource-intensive and require the involvement of many laboratories and/or taking measurements at various times. This inevitably leads to batch effects, and systematic variations in the data that might occur due to different technology platforms, reagent lots, or handling personnel. Such technical differences confound biological variations of interest and need to be corrected during the data integration process. Data integration is a challenging task due to the overlapping of biological and technical factors, which makes it difficult to distinguish their individual contribution to the overall observed effect. Moreover, the choice of integration method may impact the downstream analyses, including searching for differentially expressed genes. From the existing data integration methods, we selected only those that return the full expression matrix. We evaluated six methods in terms of their influence on the performance of differential gene expression analysis in two single-cell datasets with the same biological study design that differ only in the way the measurement was done: one dataset manifests strong batch effects due to the measurements of each sample at a different time. Integrated data were visualized using the UMAP method. The evaluation was done both on individual gene level using parametric and non-parametric approaches for finding differentially expressed genes and on gene set level using gene set enrichment analysis. As an evaluation metric, we used two correlation coefficients, Pearson and Spearman, of the obtained test statistics between reference, test, and corrected studies. Visual comparison of UMAP plots highlighted ComBat-seq, limma, and MNN, which reduced batch effects and preserved differences between biological conditions. Most of the tested methods changed the data distribution after integration, which negatively impacts the use of parametric methods for the analysis. Two algorithms, MNN and Scanorama, gave very poor results in terms of differential analysis on gene and gene set levels. Finally, we highlight ComBat-seq as it led to the highest correlation of test statistics between reference and corrected dataset among others. Moreover, it does not distort the original distribution of gene expression data, so it can be used in all types of downstream analyses. Frontiers Media S.A. 2022-11-01 /pmc/articles/PMC9663917/ /pubmed/36386846 http://dx.doi.org/10.3389/fgene.2022.1009316 Text en Copyright © 2022 Kujawa, Marczyk and Polanska. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Genetics
Kujawa, Tomasz
Marczyk, Michał
Polanska, Joanna
Influence of single-cell RNA sequencing data integration on the performance of differential gene expression analysis
title Influence of single-cell RNA sequencing data integration on the performance of differential gene expression analysis
title_full Influence of single-cell RNA sequencing data integration on the performance of differential gene expression analysis
title_fullStr Influence of single-cell RNA sequencing data integration on the performance of differential gene expression analysis
title_full_unstemmed Influence of single-cell RNA sequencing data integration on the performance of differential gene expression analysis
title_short Influence of single-cell RNA sequencing data integration on the performance of differential gene expression analysis
title_sort influence of single-cell rna sequencing data integration on the performance of differential gene expression analysis
topic Genetics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9663917/
https://www.ncbi.nlm.nih.gov/pubmed/36386846
http://dx.doi.org/10.3389/fgene.2022.1009316
work_keys_str_mv AT kujawatomasz influenceofsinglecellrnasequencingdataintegrationontheperformanceofdifferentialgeneexpressionanalysis
AT marczykmichał influenceofsinglecellrnasequencingdataintegrationontheperformanceofdifferentialgeneexpressionanalysis
AT polanskajoanna influenceofsinglecellrnasequencingdataintegrationontheperformanceofdifferentialgeneexpressionanalysis