Cargando…

Automated annotation of rare-cell types from single-cell RNA-sequencing data through synthetic oversampling

BACKGROUND: The research landscape of single-cell and single-nuclei RNA-sequencing is evolving rapidly. In particular, the area for the detection of rare cells was highly facilitated by this technology. However, an automated, unbiased, and accurate annotation of rare subpopulations is challenging. O...

Descripción completa

Detalles Bibliográficos
Autores principales:	Bej, Saptarshi, Galow, Anne-Marie, David, Robert, Wolfien, Markus, Wolkenhauer, Olaf
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2021
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8603509/ https://www.ncbi.nlm.nih.gov/pubmed/34798805 http://dx.doi.org/10.1186/s12859-021-04469-x

_version_	1784601779062702080
author	Bej, Saptarshi Galow, Anne-Marie David, Robert Wolfien, Markus Wolkenhauer, Olaf
author_facet	Bej, Saptarshi Galow, Anne-Marie David, Robert Wolfien, Markus Wolkenhauer, Olaf
author_sort	Bej, Saptarshi
collection	PubMed
description	BACKGROUND: The research landscape of single-cell and single-nuclei RNA-sequencing is evolving rapidly. In particular, the area for the detection of rare cells was highly facilitated by this technology. However, an automated, unbiased, and accurate annotation of rare subpopulations is challenging. Once rare cells are identified in one dataset, it is usually necessary to generate further specific datasets to enrich the analysis (e.g., with samples from other tissues). From a machine learning perspective, the challenge arises from the fact that rare-cell subpopulations constitute an imbalanced classification problem. We here introduce a Machine Learning (ML)-based oversampling method that uses gene expression counts of already identified rare cells as an input to generate synthetic cells to then identify similar (rare) cells in other publicly available experiments. We utilize single-cell synthetic oversampling (sc-SynO), which is based on the Localized Random Affine Shadowsampling (LoRAS) algorithm. The algorithm corrects for the overall imbalance ratio of the minority and majority class. RESULTS: We demonstrate the effectiveness of our method for three independent use cases, each consisting of already published datasets. The first use case identifies cardiac glial cells in snRNA-Seq data (17 nuclei out of 8635). This use case was designed to take a larger imbalance ratio (~1 to 500) into account and only uses single-nuclei data. The second use case was designed to jointly use snRNA-Seq data and scRNA-Seq on a lower imbalance ratio (~1 to 26) for the training step to likewise investigate the potential of the algorithm to consider both single-cell capture procedures and the impact of “less” rare-cell types. The third dataset refers to the murine data of the Allen Brain Atlas, including more than 1 million cells. For validation purposes only, all datasets have also been analyzed traditionally using common data analysis approaches, such as the Seurat workflow. CONCLUSIONS: In comparison to baseline testing without oversampling, our approach identifies rare-cells with a robust precision-recall balance, including a high accuracy and low false positive detection rate. A practical benefit of our algorithm is that it can be readily implemented in other and existing workflows. The code basis in R and Python is publicly available at FairdomHub, as well as GitHub, and can easily be transferred to identify other rare-cell types.
format	Online Article Text
id	pubmed-8603509
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-86035092021-11-19 Automated annotation of rare-cell types from single-cell RNA-sequencing data through synthetic oversampling Bej, Saptarshi Galow, Anne-Marie David, Robert Wolfien, Markus Wolkenhauer, Olaf BMC Bioinformatics Research BACKGROUND: The research landscape of single-cell and single-nuclei RNA-sequencing is evolving rapidly. In particular, the area for the detection of rare cells was highly facilitated by this technology. However, an automated, unbiased, and accurate annotation of rare subpopulations is challenging. Once rare cells are identified in one dataset, it is usually necessary to generate further specific datasets to enrich the analysis (e.g., with samples from other tissues). From a machine learning perspective, the challenge arises from the fact that rare-cell subpopulations constitute an imbalanced classification problem. We here introduce a Machine Learning (ML)-based oversampling method that uses gene expression counts of already identified rare cells as an input to generate synthetic cells to then identify similar (rare) cells in other publicly available experiments. We utilize single-cell synthetic oversampling (sc-SynO), which is based on the Localized Random Affine Shadowsampling (LoRAS) algorithm. The algorithm corrects for the overall imbalance ratio of the minority and majority class. RESULTS: We demonstrate the effectiveness of our method for three independent use cases, each consisting of already published datasets. The first use case identifies cardiac glial cells in snRNA-Seq data (17 nuclei out of 8635). This use case was designed to take a larger imbalance ratio (~1 to 500) into account and only uses single-nuclei data. The second use case was designed to jointly use snRNA-Seq data and scRNA-Seq on a lower imbalance ratio (~1 to 26) for the training step to likewise investigate the potential of the algorithm to consider both single-cell capture procedures and the impact of “less” rare-cell types. The third dataset refers to the murine data of the Allen Brain Atlas, including more than 1 million cells. For validation purposes only, all datasets have also been analyzed traditionally using common data analysis approaches, such as the Seurat workflow. CONCLUSIONS: In comparison to baseline testing without oversampling, our approach identifies rare-cells with a robust precision-recall balance, including a high accuracy and low false positive detection rate. A practical benefit of our algorithm is that it can be readily implemented in other and existing workflows. The code basis in R and Python is publicly available at FairdomHub, as well as GitHub, and can easily be transferred to identify other rare-cell types. BioMed Central 2021-11-19 /pmc/articles/PMC8603509/ /pubmed/34798805 http://dx.doi.org/10.1186/s12859-021-04469-x Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Research Bej, Saptarshi Galow, Anne-Marie David, Robert Wolfien, Markus Wolkenhauer, Olaf Automated annotation of rare-cell types from single-cell RNA-sequencing data through synthetic oversampling
title	Automated annotation of rare-cell types from single-cell RNA-sequencing data through synthetic oversampling
title_full	Automated annotation of rare-cell types from single-cell RNA-sequencing data through synthetic oversampling
title_fullStr	Automated annotation of rare-cell types from single-cell RNA-sequencing data through synthetic oversampling
title_full_unstemmed	Automated annotation of rare-cell types from single-cell RNA-sequencing data through synthetic oversampling
title_short	Automated annotation of rare-cell types from single-cell RNA-sequencing data through synthetic oversampling
title_sort	automated annotation of rare-cell types from single-cell rna-sequencing data through synthetic oversampling
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8603509/ https://www.ncbi.nlm.nih.gov/pubmed/34798805 http://dx.doi.org/10.1186/s12859-021-04469-x
work_keys_str_mv	AT bejsaptarshi automatedannotationofrarecelltypesfromsinglecellrnasequencingdatathroughsyntheticoversampling AT galowannemarie automatedannotationofrarecelltypesfromsinglecellrnasequencingdatathroughsyntheticoversampling AT davidrobert automatedannotationofrarecelltypesfromsinglecellrnasequencingdatathroughsyntheticoversampling AT wolfienmarkus automatedannotationofrarecelltypesfromsinglecellrnasequencingdatathroughsyntheticoversampling AT wolkenhauerolaf automatedannotationofrarecelltypesfromsinglecellrnasequencingdatathroughsyntheticoversampling

Automated annotation of rare-cell types from single-cell RNA-sequencing data through synthetic oversampling

Ejemplares similares