Cargando…
A comparison of rule-based and centroid single-sample multiclass predictors for transcriptomic classification
MOTIVATION: Gene expression-based multiclass prediction, such as tumor subtyping, is a non-trivial bioinformatic problem. Most classifier methods operate by comparing expression levels relative to other samples. Methods that base predictions on the expression pattern within a sample have been propos...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8796360/ https://www.ncbi.nlm.nih.gov/pubmed/34788787 http://dx.doi.org/10.1093/bioinformatics/btab763 |
_version_ | 1784641288330543104 |
---|---|
author | Eriksson, Pontus Marzouka, Nour-al-dain Sjödahl, Gottfrid Bernardo, Carina Liedberg, Fredrik Höglund, Mattias |
author_facet | Eriksson, Pontus Marzouka, Nour-al-dain Sjödahl, Gottfrid Bernardo, Carina Liedberg, Fredrik Höglund, Mattias |
author_sort | Eriksson, Pontus |
collection | PubMed |
description | MOTIVATION: Gene expression-based multiclass prediction, such as tumor subtyping, is a non-trivial bioinformatic problem. Most classifier methods operate by comparing expression levels relative to other samples. Methods that base predictions on the expression pattern within a sample have been proposed as an alternative. As these methods are invariant to the cohort composition and can be applied to a sample in isolation, they can collectively be termed single sample predictors (SSP). Such predictors could potentially be used for preprocessing-free classification of new samples and be built to function across different expression platforms where proper batch and dataset normalization is challenging. Here, we evaluate the behavior of several multiclass SSPs based on binary gene-pair rules (k-Top Scoring Pairs, Absolute Intrinsic Molecular Subtyping and a new Random Forest approach) and compare them to centroids built with centered or raw expression values, with the criteria that an optimal predictor should have high accuracy, overcome differences in tumor purity, be robust across expression platforms and provide an informative prediction output score. RESULTS: We found that gene-pair-based SSPs showed excellent performance on many expression-based classification tasks. The three methods differed in prediction score output, handling of tied scores and behavior in low purity samples. The k-Top Scoring Pairs and Random Forest approach both achieved high classification accuracy while providing an informative prediction score. Although gene-pair-based SSPs have been touted as being cross-platform compatible (through training on mixed platform data), out-of-the-box compatibility with a new dataset remains a potential issue that warrants cohort-to-cohort verification. AVAILABILITY AND IMPLEMENTATION: Our R package ‘multiclassPairs’ (https://cran.r-project.org/package=multiclassPairs) (https://doi.org/10.1093/bioinformatics/btab088) is freely available and enables easy training, prediction, and visualization using the gene-pair rule-based Random Forest SSP method and provides additional multiclass functionalities to the switchBox k-Top-Scoring Pairs package. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. |
format | Online Article Text |
id | pubmed-8796360 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-87963602022-01-31 A comparison of rule-based and centroid single-sample multiclass predictors for transcriptomic classification Eriksson, Pontus Marzouka, Nour-al-dain Sjödahl, Gottfrid Bernardo, Carina Liedberg, Fredrik Höglund, Mattias Bioinformatics Original Papers MOTIVATION: Gene expression-based multiclass prediction, such as tumor subtyping, is a non-trivial bioinformatic problem. Most classifier methods operate by comparing expression levels relative to other samples. Methods that base predictions on the expression pattern within a sample have been proposed as an alternative. As these methods are invariant to the cohort composition and can be applied to a sample in isolation, they can collectively be termed single sample predictors (SSP). Such predictors could potentially be used for preprocessing-free classification of new samples and be built to function across different expression platforms where proper batch and dataset normalization is challenging. Here, we evaluate the behavior of several multiclass SSPs based on binary gene-pair rules (k-Top Scoring Pairs, Absolute Intrinsic Molecular Subtyping and a new Random Forest approach) and compare them to centroids built with centered or raw expression values, with the criteria that an optimal predictor should have high accuracy, overcome differences in tumor purity, be robust across expression platforms and provide an informative prediction output score. RESULTS: We found that gene-pair-based SSPs showed excellent performance on many expression-based classification tasks. The three methods differed in prediction score output, handling of tied scores and behavior in low purity samples. The k-Top Scoring Pairs and Random Forest approach both achieved high classification accuracy while providing an informative prediction score. Although gene-pair-based SSPs have been touted as being cross-platform compatible (through training on mixed platform data), out-of-the-box compatibility with a new dataset remains a potential issue that warrants cohort-to-cohort verification. AVAILABILITY AND IMPLEMENTATION: Our R package ‘multiclassPairs’ (https://cran.r-project.org/package=multiclassPairs) (https://doi.org/10.1093/bioinformatics/btab088) is freely available and enables easy training, prediction, and visualization using the gene-pair rule-based Random Forest SSP method and provides additional multiclass functionalities to the switchBox k-Top-Scoring Pairs package. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2021-11-12 /pmc/articles/PMC8796360/ /pubmed/34788787 http://dx.doi.org/10.1093/bioinformatics/btab763 Text en © The Author(s) 2021. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Original Papers Eriksson, Pontus Marzouka, Nour-al-dain Sjödahl, Gottfrid Bernardo, Carina Liedberg, Fredrik Höglund, Mattias A comparison of rule-based and centroid single-sample multiclass predictors for transcriptomic classification |
title | A comparison of rule-based and centroid single-sample multiclass predictors for transcriptomic classification |
title_full | A comparison of rule-based and centroid single-sample multiclass predictors for transcriptomic classification |
title_fullStr | A comparison of rule-based and centroid single-sample multiclass predictors for transcriptomic classification |
title_full_unstemmed | A comparison of rule-based and centroid single-sample multiclass predictors for transcriptomic classification |
title_short | A comparison of rule-based and centroid single-sample multiclass predictors for transcriptomic classification |
title_sort | comparison of rule-based and centroid single-sample multiclass predictors for transcriptomic classification |
topic | Original Papers |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8796360/ https://www.ncbi.nlm.nih.gov/pubmed/34788787 http://dx.doi.org/10.1093/bioinformatics/btab763 |
work_keys_str_mv | AT erikssonpontus acomparisonofrulebasedandcentroidsinglesamplemulticlasspredictorsfortranscriptomicclassification AT marzoukanouraldain acomparisonofrulebasedandcentroidsinglesamplemulticlasspredictorsfortranscriptomicclassification AT sjodahlgottfrid acomparisonofrulebasedandcentroidsinglesamplemulticlasspredictorsfortranscriptomicclassification AT bernardocarina acomparisonofrulebasedandcentroidsinglesamplemulticlasspredictorsfortranscriptomicclassification AT liedbergfredrik acomparisonofrulebasedandcentroidsinglesamplemulticlasspredictorsfortranscriptomicclassification AT hoglundmattias acomparisonofrulebasedandcentroidsinglesamplemulticlasspredictorsfortranscriptomicclassification AT erikssonpontus comparisonofrulebasedandcentroidsinglesamplemulticlasspredictorsfortranscriptomicclassification AT marzoukanouraldain comparisonofrulebasedandcentroidsinglesamplemulticlasspredictorsfortranscriptomicclassification AT sjodahlgottfrid comparisonofrulebasedandcentroidsinglesamplemulticlasspredictorsfortranscriptomicclassification AT bernardocarina comparisonofrulebasedandcentroidsinglesamplemulticlasspredictorsfortranscriptomicclassification AT liedbergfredrik comparisonofrulebasedandcentroidsinglesamplemulticlasspredictorsfortranscriptomicclassification AT hoglundmattias comparisonofrulebasedandcentroidsinglesamplemulticlasspredictorsfortranscriptomicclassification |