Cargando…

Factors affecting the accuracy of a class prediction model in gene expression data

BACKGROUND: Class prediction models have been shown to have varying performances in clinical gene expression datasets. Previous evaluation studies, mostly done in the field of cancer, showed that the accuracy of class prediction models differs from dataset to dataset and depends on the type of class...

Descripción completa

Detalles Bibliográficos
Autores principales: Novianti, Putri W., Jong, Victor L., Roes, Kit C. B., Eijkemans, Marinus J. C.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4475623/
https://www.ncbi.nlm.nih.gov/pubmed/26093633
http://dx.doi.org/10.1186/s12859-015-0610-4
_version_ 1782377487310782464
author Novianti, Putri W.
Jong, Victor L.
Roes, Kit C. B.
Eijkemans, Marinus J. C.
author_facet Novianti, Putri W.
Jong, Victor L.
Roes, Kit C. B.
Eijkemans, Marinus J. C.
author_sort Novianti, Putri W.
collection PubMed
description BACKGROUND: Class prediction models have been shown to have varying performances in clinical gene expression datasets. Previous evaluation studies, mostly done in the field of cancer, showed that the accuracy of class prediction models differs from dataset to dataset and depends on the type of classification function. While a substantial amount of information is known about the characteristics of classification functions, little has been done to determine which characteristics of gene expression data have impact on the performance of a classifier. This study aims to empirically identify data characteristics that affect the predictive accuracy of classification models, outside of the field of cancer. RESULTS: Datasets from twenty five studies meeting predefined inclusion and exclusion criteria were downloaded. Nine classification functions were chosen, falling within the categories: discriminant analyses or Bayes classifiers, tree based, regularization and shrinkage and nearest neighbors methods. Consequently, nine class prediction models were built for each dataset using the same procedure and their performances were evaluated by calculating their accuracies. The characteristics of each experiment were recorded, (i.e., observed disease, medical question, tissue/cell types and sample size) together with characteristics of the gene expression data, namely the number of differentially expressed genes, the fold changes and the within-class correlations. Their effects on the accuracy of a class prediction model were statistically assessed by random effects logistic regression. The number of differentially expressed genes and the average fold change had significant impact on the accuracy of a classification model and gave individual explained-variation in prediction accuracy of up to 72% and 57%, respectively. Multivariable random effects logistic regression with forward selection yielded the two aforementioned study factors and the within class correlation as factors affecting the accuracy of classification functions, explaining 91.5% of the between study variation. CONCLUSIONS: We evaluated study- and data-related factors that might explain the varying performances of classification functions in non-cancerous datasets. Our results showed that the number of differentially expressed genes, the fold change, and the correlation in gene expression data significantly affect the accuracy of class prediction models. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0610-4) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4475623
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-44756232015-06-22 Factors affecting the accuracy of a class prediction model in gene expression data Novianti, Putri W. Jong, Victor L. Roes, Kit C. B. Eijkemans, Marinus J. C. BMC Bioinformatics Research Article BACKGROUND: Class prediction models have been shown to have varying performances in clinical gene expression datasets. Previous evaluation studies, mostly done in the field of cancer, showed that the accuracy of class prediction models differs from dataset to dataset and depends on the type of classification function. While a substantial amount of information is known about the characteristics of classification functions, little has been done to determine which characteristics of gene expression data have impact on the performance of a classifier. This study aims to empirically identify data characteristics that affect the predictive accuracy of classification models, outside of the field of cancer. RESULTS: Datasets from twenty five studies meeting predefined inclusion and exclusion criteria were downloaded. Nine classification functions were chosen, falling within the categories: discriminant analyses or Bayes classifiers, tree based, regularization and shrinkage and nearest neighbors methods. Consequently, nine class prediction models were built for each dataset using the same procedure and their performances were evaluated by calculating their accuracies. The characteristics of each experiment were recorded, (i.e., observed disease, medical question, tissue/cell types and sample size) together with characteristics of the gene expression data, namely the number of differentially expressed genes, the fold changes and the within-class correlations. Their effects on the accuracy of a class prediction model were statistically assessed by random effects logistic regression. The number of differentially expressed genes and the average fold change had significant impact on the accuracy of a classification model and gave individual explained-variation in prediction accuracy of up to 72% and 57%, respectively. Multivariable random effects logistic regression with forward selection yielded the two aforementioned study factors and the within class correlation as factors affecting the accuracy of classification functions, explaining 91.5% of the between study variation. CONCLUSIONS: We evaluated study- and data-related factors that might explain the varying performances of classification functions in non-cancerous datasets. Our results showed that the number of differentially expressed genes, the fold change, and the correlation in gene expression data significantly affect the accuracy of class prediction models. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0610-4) contains supplementary material, which is available to authorized users. BioMed Central 2015-06-21 /pmc/articles/PMC4475623/ /pubmed/26093633 http://dx.doi.org/10.1186/s12859-015-0610-4 Text en © Novianti et al. 2015 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Novianti, Putri W.
Jong, Victor L.
Roes, Kit C. B.
Eijkemans, Marinus J. C.
Factors affecting the accuracy of a class prediction model in gene expression data
title Factors affecting the accuracy of a class prediction model in gene expression data
title_full Factors affecting the accuracy of a class prediction model in gene expression data
title_fullStr Factors affecting the accuracy of a class prediction model in gene expression data
title_full_unstemmed Factors affecting the accuracy of a class prediction model in gene expression data
title_short Factors affecting the accuracy of a class prediction model in gene expression data
title_sort factors affecting the accuracy of a class prediction model in gene expression data
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4475623/
https://www.ncbi.nlm.nih.gov/pubmed/26093633
http://dx.doi.org/10.1186/s12859-015-0610-4
work_keys_str_mv AT noviantiputriw factorsaffectingtheaccuracyofaclasspredictionmodelingeneexpressiondata
AT jongvictorl factorsaffectingtheaccuracyofaclasspredictionmodelingeneexpressiondata
AT roeskitcb factorsaffectingtheaccuracyofaclasspredictionmodelingeneexpressiondata
AT eijkemansmarinusjc factorsaffectingtheaccuracyofaclasspredictionmodelingeneexpressiondata