Cargando…

Benchmarking atlas-level data integration in single-cell genomics

Single-cell atlases often include samples that span locations, laboratories and conditions, leading to complex, nested batch effects in data. Thus, joint analysis of atlas datasets requires reliable data integration. To guide integration method choice, we benchmarked 68 method and preprocessing comb...

Descripción completa

Detalles Bibliográficos
Autores principales: Luecken, Malte D., Büttner, M., Chaichoompu, K., Danese, A., Interlandi, M., Mueller, M. F., Strobl, D. C., Zappia, L., Dugas, M., Colomé-Tatché, M., Theis, Fabian J.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group US 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8748196/
https://www.ncbi.nlm.nih.gov/pubmed/34949812
http://dx.doi.org/10.1038/s41592-021-01336-8
_version_ 1784630973863821312
author Luecken, Malte D.
Büttner, M.
Chaichoompu, K.
Danese, A.
Interlandi, M.
Mueller, M. F.
Strobl, D. C.
Zappia, L.
Dugas, M.
Colomé-Tatché, M.
Theis, Fabian J.
author_facet Luecken, Malte D.
Büttner, M.
Chaichoompu, K.
Danese, A.
Interlandi, M.
Mueller, M. F.
Strobl, D. C.
Zappia, L.
Dugas, M.
Colomé-Tatché, M.
Theis, Fabian J.
author_sort Luecken, Malte D.
collection PubMed
description Single-cell atlases often include samples that span locations, laboratories and conditions, leading to complex, nested batch effects in data. Thus, joint analysis of atlas datasets requires reliable data integration. To guide integration method choice, we benchmarked 68 method and preprocessing combinations on 85 batches of gene expression, chromatin accessibility and simulation data from 23 publications, altogether representing >1.2 million cells distributed in 13 atlas-level integration tasks. We evaluated methods according to scalability, usability and their ability to remove batch effects while retaining biological variation using 14 evaluation metrics. We show that highly variable gene selection improves the performance of data integration methods, whereas scaling pushes methods to prioritize batch removal over conservation of biological variation. Overall, scANVI, Scanorama, scVI and scGen perform well, particularly on complex integration tasks, while single-cell ATAC-sequencing integration performance is strongly affected by choice of feature space. Our freely available Python module and benchmarking pipeline can identify optimal data integration methods for new data, benchmark new methods and improve method development.
format Online
Article
Text
id pubmed-8748196
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Nature Publishing Group US
record_format MEDLINE/PubMed
spelling pubmed-87481962022-01-20 Benchmarking atlas-level data integration in single-cell genomics Luecken, Malte D. Büttner, M. Chaichoompu, K. Danese, A. Interlandi, M. Mueller, M. F. Strobl, D. C. Zappia, L. Dugas, M. Colomé-Tatché, M. Theis, Fabian J. Nat Methods Analysis Single-cell atlases often include samples that span locations, laboratories and conditions, leading to complex, nested batch effects in data. Thus, joint analysis of atlas datasets requires reliable data integration. To guide integration method choice, we benchmarked 68 method and preprocessing combinations on 85 batches of gene expression, chromatin accessibility and simulation data from 23 publications, altogether representing >1.2 million cells distributed in 13 atlas-level integration tasks. We evaluated methods according to scalability, usability and their ability to remove batch effects while retaining biological variation using 14 evaluation metrics. We show that highly variable gene selection improves the performance of data integration methods, whereas scaling pushes methods to prioritize batch removal over conservation of biological variation. Overall, scANVI, Scanorama, scVI and scGen perform well, particularly on complex integration tasks, while single-cell ATAC-sequencing integration performance is strongly affected by choice of feature space. Our freely available Python module and benchmarking pipeline can identify optimal data integration methods for new data, benchmark new methods and improve method development. Nature Publishing Group US 2021-12-23 2022 /pmc/articles/PMC8748196/ /pubmed/34949812 http://dx.doi.org/10.1038/s41592-021-01336-8 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Analysis
Luecken, Malte D.
Büttner, M.
Chaichoompu, K.
Danese, A.
Interlandi, M.
Mueller, M. F.
Strobl, D. C.
Zappia, L.
Dugas, M.
Colomé-Tatché, M.
Theis, Fabian J.
Benchmarking atlas-level data integration in single-cell genomics
title Benchmarking atlas-level data integration in single-cell genomics
title_full Benchmarking atlas-level data integration in single-cell genomics
title_fullStr Benchmarking atlas-level data integration in single-cell genomics
title_full_unstemmed Benchmarking atlas-level data integration in single-cell genomics
title_short Benchmarking atlas-level data integration in single-cell genomics
title_sort benchmarking atlas-level data integration in single-cell genomics
topic Analysis
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8748196/
https://www.ncbi.nlm.nih.gov/pubmed/34949812
http://dx.doi.org/10.1038/s41592-021-01336-8
work_keys_str_mv AT lueckenmalted benchmarkingatlasleveldataintegrationinsinglecellgenomics
AT buttnerm benchmarkingatlasleveldataintegrationinsinglecellgenomics
AT chaichoompuk benchmarkingatlasleveldataintegrationinsinglecellgenomics
AT danesea benchmarkingatlasleveldataintegrationinsinglecellgenomics
AT interlandim benchmarkingatlasleveldataintegrationinsinglecellgenomics
AT muellermf benchmarkingatlasleveldataintegrationinsinglecellgenomics
AT strobldc benchmarkingatlasleveldataintegrationinsinglecellgenomics
AT zappial benchmarkingatlasleveldataintegrationinsinglecellgenomics
AT dugasm benchmarkingatlasleveldataintegrationinsinglecellgenomics
AT colometatchem benchmarkingatlasleveldataintegrationinsinglecellgenomics
AT theisfabianj benchmarkingatlasleveldataintegrationinsinglecellgenomics