Cargando…

A rank-based marker selection method for high throughput scRNA-seq data

BACKGROUND: High throughput microfluidic protocols in single cell RNA sequencing (scRNA-seq) collect mRNA counts from up to one million individual cells in a single experiment; this enables high resolution studies of rare cell types and cell development pathways. Determining small sets of genetic ma...

Descripción completa

Detalles Bibliográficos
Autores principales: Vargo, Alexander H. S., Gilbert, Anna C.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7585212/
https://www.ncbi.nlm.nih.gov/pubmed/33097004
http://dx.doi.org/10.1186/s12859-020-03641-z
_version_ 1783599740335685632
author Vargo, Alexander H. S.
Gilbert, Anna C.
author_facet Vargo, Alexander H. S.
Gilbert, Anna C.
author_sort Vargo, Alexander H. S.
collection PubMed
description BACKGROUND: High throughput microfluidic protocols in single cell RNA sequencing (scRNA-seq) collect mRNA counts from up to one million individual cells in a single experiment; this enables high resolution studies of rare cell types and cell development pathways. Determining small sets of genetic markers that can identify specific cell populations is thus one of the major objectives of computational analysis of mRNA counts data. Many tools have been developed for marker selection on single cell data; most of them, however, are based on complex statistical models and handle the multi-class case in an ad-hoc manner. RESULTS: We introduce RankCorr, a fast method with strong mathematical underpinnings that performs multi-class marker selection in an informed manner. RankCorr proceeds by ranking the mRNA counts data before linearly separating the ranked data using a small number of genes. The step of ranking is intuitively natural for scRNA-seq data and provides a non-parametric method for analyzing count data. In addition, we present several performance measures for evaluating the quality of a set of markers when there is no known ground truth. Using these metrics, we compare the performance of RankCorr to a variety of other marker selection methods on an assortment of experimental and synthetic data sets that range in size from several thousand to one million cells. CONCLUSIONS: According to the metrics introduced in this work, RankCorr is consistently one of most optimal marker selection methods on scRNA-seq data. Most methods show similar overall performance, however; thus, the speed of the algorithm is the most important consideration for large data sets (and comparing the markers selected by several methods can be fruitful). RankCorr is fast enough to easily handle the largest data sets and, as such, it is a useful tool to add into computational pipelines when dealing with high throughput scRNA-seq data. RankCorr software is available for download at https://github.com/ahsv/RankCorrwith extensive documentation.
format Online
Article
Text
id pubmed-7585212
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-75852122020-10-26 A rank-based marker selection method for high throughput scRNA-seq data Vargo, Alexander H. S. Gilbert, Anna C. BMC Bioinformatics Methodology Article BACKGROUND: High throughput microfluidic protocols in single cell RNA sequencing (scRNA-seq) collect mRNA counts from up to one million individual cells in a single experiment; this enables high resolution studies of rare cell types and cell development pathways. Determining small sets of genetic markers that can identify specific cell populations is thus one of the major objectives of computational analysis of mRNA counts data. Many tools have been developed for marker selection on single cell data; most of them, however, are based on complex statistical models and handle the multi-class case in an ad-hoc manner. RESULTS: We introduce RankCorr, a fast method with strong mathematical underpinnings that performs multi-class marker selection in an informed manner. RankCorr proceeds by ranking the mRNA counts data before linearly separating the ranked data using a small number of genes. The step of ranking is intuitively natural for scRNA-seq data and provides a non-parametric method for analyzing count data. In addition, we present several performance measures for evaluating the quality of a set of markers when there is no known ground truth. Using these metrics, we compare the performance of RankCorr to a variety of other marker selection methods on an assortment of experimental and synthetic data sets that range in size from several thousand to one million cells. CONCLUSIONS: According to the metrics introduced in this work, RankCorr is consistently one of most optimal marker selection methods on scRNA-seq data. Most methods show similar overall performance, however; thus, the speed of the algorithm is the most important consideration for large data sets (and comparing the markers selected by several methods can be fruitful). RankCorr is fast enough to easily handle the largest data sets and, as such, it is a useful tool to add into computational pipelines when dealing with high throughput scRNA-seq data. RankCorr software is available for download at https://github.com/ahsv/RankCorrwith extensive documentation. BioMed Central 2020-10-23 /pmc/articles/PMC7585212/ /pubmed/33097004 http://dx.doi.org/10.1186/s12859-020-03641-z Text en © The Author(s) 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Methodology Article
Vargo, Alexander H. S.
Gilbert, Anna C.
A rank-based marker selection method for high throughput scRNA-seq data
title A rank-based marker selection method for high throughput scRNA-seq data
title_full A rank-based marker selection method for high throughput scRNA-seq data
title_fullStr A rank-based marker selection method for high throughput scRNA-seq data
title_full_unstemmed A rank-based marker selection method for high throughput scRNA-seq data
title_short A rank-based marker selection method for high throughput scRNA-seq data
title_sort rank-based marker selection method for high throughput scrna-seq data
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7585212/
https://www.ncbi.nlm.nih.gov/pubmed/33097004
http://dx.doi.org/10.1186/s12859-020-03641-z
work_keys_str_mv AT vargoalexanderhs arankbasedmarkerselectionmethodforhighthroughputscrnaseqdata
AT gilbertannac arankbasedmarkerselectionmethodforhighthroughputscrnaseqdata
AT vargoalexanderhs rankbasedmarkerselectionmethodforhighthroughputscrnaseqdata
AT gilbertannac rankbasedmarkerselectionmethodforhighthroughputscrnaseqdata