Cargando…

Evaluation of data discretization methods to derive platform independent isoform expression signatures for multi-class tumor subtyping

BACKGROUND: Many supervised learning algorithms have been applied in deriving gene signatures for patient stratification from gene expression data. However, transferring the multi-gene signatures from one analytical platform to another without loss of classification accuracy is a major challenge. He...

Descripción completa

Detalles Bibliográficos
Autores principales:	Jung, Segun, Bi, Yingtao, Davuluri, Ramana V
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2015
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4652565/ https://www.ncbi.nlm.nih.gov/pubmed/26576613 http://dx.doi.org/10.1186/1471-2164-16-S11-S3

_version_	1782401781735620608
author	Jung, Segun Bi, Yingtao Davuluri, Ramana V
author_facet	Jung, Segun Bi, Yingtao Davuluri, Ramana V
author_sort	Jung, Segun
collection	PubMed
description	BACKGROUND: Many supervised learning algorithms have been applied in deriving gene signatures for patient stratification from gene expression data. However, transferring the multi-gene signatures from one analytical platform to another without loss of classification accuracy is a major challenge. Here, we compared three unsupervised data discretization methods--Equal-width binning, Equal-frequency binning, and k-means clustering--in accurately classifying the four known subtypes of glioblastoma multiforme (GBM) when the classification algorithms were trained on the isoform-level gene expression profiles from exon-array platform and tested on the corresponding profiles from RNA-seq data. RESULTS: We applied an integrated machine learning framework that involves three sequential steps; feature selection, data discretization, and classification. For models trained and tested on exon-array data, the addition of data discretization step led to robust and accurate predictive models with fewer number of variables in the final models. For models trained on exon-array data and tested on RNA-seq data, the addition of data discretization step dramatically improved the classification accuracies with Equal-frequency binning showing the highest improvement with more than 90% accuracies for all the models with features chosen by Random Forest based feature selection. Overall, SVM classifier coupled with Equal-frequency binning achieved the best accuracy (> 95%). Without data discretization, however, only 73.6% accuracy was achieved at most. CONCLUSIONS: The classification algorithms, trained and tested on data from the same platform, yielded similar accuracies in predicting the four GBM subgroups. However, when dealing with cross-platform data, from exon-array to RNA-seq, the classifiers yielded stable models with highest classification accuracies on data transformed by Equal frequency binning. The approach presented here is generally applicable to other cancer types for classification and identification of molecular subgroups by integrating data across different gene expression platforms.
format	Online Article Text
id	pubmed-4652565
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-46525652015-11-25 Evaluation of data discretization methods to derive platform independent isoform expression signatures for multi-class tumor subtyping Jung, Segun Bi, Yingtao Davuluri, Ramana V BMC Genomics Research BACKGROUND: Many supervised learning algorithms have been applied in deriving gene signatures for patient stratification from gene expression data. However, transferring the multi-gene signatures from one analytical platform to another without loss of classification accuracy is a major challenge. Here, we compared three unsupervised data discretization methods--Equal-width binning, Equal-frequency binning, and k-means clustering--in accurately classifying the four known subtypes of glioblastoma multiforme (GBM) when the classification algorithms were trained on the isoform-level gene expression profiles from exon-array platform and tested on the corresponding profiles from RNA-seq data. RESULTS: We applied an integrated machine learning framework that involves three sequential steps; feature selection, data discretization, and classification. For models trained and tested on exon-array data, the addition of data discretization step led to robust and accurate predictive models with fewer number of variables in the final models. For models trained on exon-array data and tested on RNA-seq data, the addition of data discretization step dramatically improved the classification accuracies with Equal-frequency binning showing the highest improvement with more than 90% accuracies for all the models with features chosen by Random Forest based feature selection. Overall, SVM classifier coupled with Equal-frequency binning achieved the best accuracy (> 95%). Without data discretization, however, only 73.6% accuracy was achieved at most. CONCLUSIONS: The classification algorithms, trained and tested on data from the same platform, yielded similar accuracies in predicting the four GBM subgroups. However, when dealing with cross-platform data, from exon-array to RNA-seq, the classifiers yielded stable models with highest classification accuracies on data transformed by Equal frequency binning. The approach presented here is generally applicable to other cancer types for classification and identification of molecular subgroups by integrating data across different gene expression platforms. BioMed Central 2015-11-10 /pmc/articles/PMC4652565/ /pubmed/26576613 http://dx.doi.org/10.1186/1471-2164-16-S11-S3 Text en Copyright © 2015 Jung et al. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Jung, Segun Bi, Yingtao Davuluri, Ramana V Evaluation of data discretization methods to derive platform independent isoform expression signatures for multi-class tumor subtyping
title	Evaluation of data discretization methods to derive platform independent isoform expression signatures for multi-class tumor subtyping
title_full	Evaluation of data discretization methods to derive platform independent isoform expression signatures for multi-class tumor subtyping
title_fullStr	Evaluation of data discretization methods to derive platform independent isoform expression signatures for multi-class tumor subtyping
title_full_unstemmed	Evaluation of data discretization methods to derive platform independent isoform expression signatures for multi-class tumor subtyping
title_short	Evaluation of data discretization methods to derive platform independent isoform expression signatures for multi-class tumor subtyping
title_sort	evaluation of data discretization methods to derive platform independent isoform expression signatures for multi-class tumor subtyping
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4652565/ https://www.ncbi.nlm.nih.gov/pubmed/26576613 http://dx.doi.org/10.1186/1471-2164-16-S11-S3
work_keys_str_mv	AT jungsegun evaluationofdatadiscretizationmethodstoderiveplatformindependentisoformexpressionsignaturesformulticlasstumorsubtyping AT biyingtao evaluationofdatadiscretizationmethodstoderiveplatformindependentisoformexpressionsignaturesformulticlasstumorsubtyping AT davuluriramanav evaluationofdatadiscretizationmethodstoderiveplatformindependentisoformexpressionsignaturesformulticlasstumorsubtyping

Evaluation of data discretization methods to derive platform independent isoform expression signatures for multi-class tumor subtyping

Ejemplares similares