Cargando…

A comparison of rule-based and centroid single-sample multiclass predictors for transcriptomic classification

MOTIVATION: Gene expression-based multiclass prediction, such as tumor subtyping, is a non-trivial bioinformatic problem. Most classifier methods operate by comparing expression levels relative to other samples. Methods that base predictions on the expression pattern within a sample have been propos...

Descripción completa

Detalles Bibliográficos
Autores principales: Eriksson, Pontus, Marzouka, Nour-al-dain, Sjödahl, Gottfrid, Bernardo, Carina, Liedberg, Fredrik, Höglund, Mattias
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8796360/
https://www.ncbi.nlm.nih.gov/pubmed/34788787
http://dx.doi.org/10.1093/bioinformatics/btab763
_version_ 1784641288330543104
author Eriksson, Pontus
Marzouka, Nour-al-dain
Sjödahl, Gottfrid
Bernardo, Carina
Liedberg, Fredrik
Höglund, Mattias
author_facet Eriksson, Pontus
Marzouka, Nour-al-dain
Sjödahl, Gottfrid
Bernardo, Carina
Liedberg, Fredrik
Höglund, Mattias
author_sort Eriksson, Pontus
collection PubMed
description MOTIVATION: Gene expression-based multiclass prediction, such as tumor subtyping, is a non-trivial bioinformatic problem. Most classifier methods operate by comparing expression levels relative to other samples. Methods that base predictions on the expression pattern within a sample have been proposed as an alternative. As these methods are invariant to the cohort composition and can be applied to a sample in isolation, they can collectively be termed single sample predictors (SSP). Such predictors could potentially be used for preprocessing-free classification of new samples and be built to function across different expression platforms where proper batch and dataset normalization is challenging. Here, we evaluate the behavior of several multiclass SSPs based on binary gene-pair rules (k-Top Scoring Pairs, Absolute Intrinsic Molecular Subtyping and a new Random Forest approach) and compare them to centroids built with centered or raw expression values, with the criteria that an optimal predictor should have high accuracy, overcome differences in tumor purity, be robust across expression platforms and provide an informative prediction output score. RESULTS: We found that gene-pair-based SSPs showed excellent performance on many expression-based classification tasks. The three methods differed in prediction score output, handling of tied scores and behavior in low purity samples. The k-Top Scoring Pairs and Random Forest approach both achieved high classification accuracy while providing an informative prediction score. Although gene-pair-based SSPs have been touted as being cross-platform compatible (through training on mixed platform data), out-of-the-box compatibility with a new dataset remains a potential issue that warrants cohort-to-cohort verification. AVAILABILITY AND IMPLEMENTATION: Our R package ‘multiclassPairs’ (https://cran.r-project.org/package=multiclassPairs) (https://doi.org/10.1093/bioinformatics/btab088) is freely available and enables easy training, prediction, and visualization using the gene-pair rule-based Random Forest SSP method and provides additional multiclass functionalities to the switchBox k-Top-Scoring Pairs package. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-8796360
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-87963602022-01-31 A comparison of rule-based and centroid single-sample multiclass predictors for transcriptomic classification Eriksson, Pontus Marzouka, Nour-al-dain Sjödahl, Gottfrid Bernardo, Carina Liedberg, Fredrik Höglund, Mattias Bioinformatics Original Papers MOTIVATION: Gene expression-based multiclass prediction, such as tumor subtyping, is a non-trivial bioinformatic problem. Most classifier methods operate by comparing expression levels relative to other samples. Methods that base predictions on the expression pattern within a sample have been proposed as an alternative. As these methods are invariant to the cohort composition and can be applied to a sample in isolation, they can collectively be termed single sample predictors (SSP). Such predictors could potentially be used for preprocessing-free classification of new samples and be built to function across different expression platforms where proper batch and dataset normalization is challenging. Here, we evaluate the behavior of several multiclass SSPs based on binary gene-pair rules (k-Top Scoring Pairs, Absolute Intrinsic Molecular Subtyping and a new Random Forest approach) and compare them to centroids built with centered or raw expression values, with the criteria that an optimal predictor should have high accuracy, overcome differences in tumor purity, be robust across expression platforms and provide an informative prediction output score. RESULTS: We found that gene-pair-based SSPs showed excellent performance on many expression-based classification tasks. The three methods differed in prediction score output, handling of tied scores and behavior in low purity samples. The k-Top Scoring Pairs and Random Forest approach both achieved high classification accuracy while providing an informative prediction score. Although gene-pair-based SSPs have been touted as being cross-platform compatible (through training on mixed platform data), out-of-the-box compatibility with a new dataset remains a potential issue that warrants cohort-to-cohort verification. AVAILABILITY AND IMPLEMENTATION: Our R package ‘multiclassPairs’ (https://cran.r-project.org/package=multiclassPairs) (https://doi.org/10.1093/bioinformatics/btab088) is freely available and enables easy training, prediction, and visualization using the gene-pair rule-based Random Forest SSP method and provides additional multiclass functionalities to the switchBox k-Top-Scoring Pairs package. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2021-11-12 /pmc/articles/PMC8796360/ /pubmed/34788787 http://dx.doi.org/10.1093/bioinformatics/btab763 Text en © The Author(s) 2021. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Papers
Eriksson, Pontus
Marzouka, Nour-al-dain
Sjödahl, Gottfrid
Bernardo, Carina
Liedberg, Fredrik
Höglund, Mattias
A comparison of rule-based and centroid single-sample multiclass predictors for transcriptomic classification
title A comparison of rule-based and centroid single-sample multiclass predictors for transcriptomic classification
title_full A comparison of rule-based and centroid single-sample multiclass predictors for transcriptomic classification
title_fullStr A comparison of rule-based and centroid single-sample multiclass predictors for transcriptomic classification
title_full_unstemmed A comparison of rule-based and centroid single-sample multiclass predictors for transcriptomic classification
title_short A comparison of rule-based and centroid single-sample multiclass predictors for transcriptomic classification
title_sort comparison of rule-based and centroid single-sample multiclass predictors for transcriptomic classification
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8796360/
https://www.ncbi.nlm.nih.gov/pubmed/34788787
http://dx.doi.org/10.1093/bioinformatics/btab763
work_keys_str_mv AT erikssonpontus acomparisonofrulebasedandcentroidsinglesamplemulticlasspredictorsfortranscriptomicclassification
AT marzoukanouraldain acomparisonofrulebasedandcentroidsinglesamplemulticlasspredictorsfortranscriptomicclassification
AT sjodahlgottfrid acomparisonofrulebasedandcentroidsinglesamplemulticlasspredictorsfortranscriptomicclassification
AT bernardocarina acomparisonofrulebasedandcentroidsinglesamplemulticlasspredictorsfortranscriptomicclassification
AT liedbergfredrik acomparisonofrulebasedandcentroidsinglesamplemulticlasspredictorsfortranscriptomicclassification
AT hoglundmattias acomparisonofrulebasedandcentroidsinglesamplemulticlasspredictorsfortranscriptomicclassification
AT erikssonpontus comparisonofrulebasedandcentroidsinglesamplemulticlasspredictorsfortranscriptomicclassification
AT marzoukanouraldain comparisonofrulebasedandcentroidsinglesamplemulticlasspredictorsfortranscriptomicclassification
AT sjodahlgottfrid comparisonofrulebasedandcentroidsinglesamplemulticlasspredictorsfortranscriptomicclassification
AT bernardocarina comparisonofrulebasedandcentroidsinglesamplemulticlasspredictorsfortranscriptomicclassification
AT liedbergfredrik comparisonofrulebasedandcentroidsinglesamplemulticlasspredictorsfortranscriptomicclassification
AT hoglundmattias comparisonofrulebasedandcentroidsinglesamplemulticlasspredictorsfortranscriptomicclassification