Cargando…

Evaluation of data discretization methods to derive platform independent isoform expression signatures for multi-class tumor subtyping

BACKGROUND: Many supervised learning algorithms have been applied in deriving gene signatures for patient stratification from gene expression data. However, transferring the multi-gene signatures from one analytical platform to another without loss of classification accuracy is a major challenge. He...

Descripción completa

Detalles Bibliográficos
Autores principales: Jung, Segun, Bi, Yingtao, Davuluri, Ramana V
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4652565/
https://www.ncbi.nlm.nih.gov/pubmed/26576613
http://dx.doi.org/10.1186/1471-2164-16-S11-S3
_version_ 1782401781735620608
author Jung, Segun
Bi, Yingtao
Davuluri, Ramana V
author_facet Jung, Segun
Bi, Yingtao
Davuluri, Ramana V
author_sort Jung, Segun
collection PubMed
description BACKGROUND: Many supervised learning algorithms have been applied in deriving gene signatures for patient stratification from gene expression data. However, transferring the multi-gene signatures from one analytical platform to another without loss of classification accuracy is a major challenge. Here, we compared three unsupervised data discretization methods--Equal-width binning, Equal-frequency binning, and k-means clustering--in accurately classifying the four known subtypes of glioblastoma multiforme (GBM) when the classification algorithms were trained on the isoform-level gene expression profiles from exon-array platform and tested on the corresponding profiles from RNA-seq data. RESULTS: We applied an integrated machine learning framework that involves three sequential steps; feature selection, data discretization, and classification. For models trained and tested on exon-array data, the addition of data discretization step led to robust and accurate predictive models with fewer number of variables in the final models. For models trained on exon-array data and tested on RNA-seq data, the addition of data discretization step dramatically improved the classification accuracies with Equal-frequency binning showing the highest improvement with more than 90% accuracies for all the models with features chosen by Random Forest based feature selection. Overall, SVM classifier coupled with Equal-frequency binning achieved the best accuracy (> 95%). Without data discretization, however, only 73.6% accuracy was achieved at most. CONCLUSIONS: The classification algorithms, trained and tested on data from the same platform, yielded similar accuracies in predicting the four GBM subgroups. However, when dealing with cross-platform data, from exon-array to RNA-seq, the classifiers yielded stable models with highest classification accuracies on data transformed by Equal frequency binning. The approach presented here is generally applicable to other cancer types for classification and identification of molecular subgroups by integrating data across different gene expression platforms.
format Online
Article
Text
id pubmed-4652565
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-46525652015-11-25 Evaluation of data discretization methods to derive platform independent isoform expression signatures for multi-class tumor subtyping Jung, Segun Bi, Yingtao Davuluri, Ramana V BMC Genomics Research BACKGROUND: Many supervised learning algorithms have been applied in deriving gene signatures for patient stratification from gene expression data. However, transferring the multi-gene signatures from one analytical platform to another without loss of classification accuracy is a major challenge. Here, we compared three unsupervised data discretization methods--Equal-width binning, Equal-frequency binning, and k-means clustering--in accurately classifying the four known subtypes of glioblastoma multiforme (GBM) when the classification algorithms were trained on the isoform-level gene expression profiles from exon-array platform and tested on the corresponding profiles from RNA-seq data. RESULTS: We applied an integrated machine learning framework that involves three sequential steps; feature selection, data discretization, and classification. For models trained and tested on exon-array data, the addition of data discretization step led to robust and accurate predictive models with fewer number of variables in the final models. For models trained on exon-array data and tested on RNA-seq data, the addition of data discretization step dramatically improved the classification accuracies with Equal-frequency binning showing the highest improvement with more than 90% accuracies for all the models with features chosen by Random Forest based feature selection. Overall, SVM classifier coupled with Equal-frequency binning achieved the best accuracy (> 95%). Without data discretization, however, only 73.6% accuracy was achieved at most. CONCLUSIONS: The classification algorithms, trained and tested on data from the same platform, yielded similar accuracies in predicting the four GBM subgroups. However, when dealing with cross-platform data, from exon-array to RNA-seq, the classifiers yielded stable models with highest classification accuracies on data transformed by Equal frequency binning. The approach presented here is generally applicable to other cancer types for classification and identification of molecular subgroups by integrating data across different gene expression platforms. BioMed Central 2015-11-10 /pmc/articles/PMC4652565/ /pubmed/26576613 http://dx.doi.org/10.1186/1471-2164-16-S11-S3 Text en Copyright © 2015 Jung et al. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Jung, Segun
Bi, Yingtao
Davuluri, Ramana V
Evaluation of data discretization methods to derive platform independent isoform expression signatures for multi-class tumor subtyping
title Evaluation of data discretization methods to derive platform independent isoform expression signatures for multi-class tumor subtyping
title_full Evaluation of data discretization methods to derive platform independent isoform expression signatures for multi-class tumor subtyping
title_fullStr Evaluation of data discretization methods to derive platform independent isoform expression signatures for multi-class tumor subtyping
title_full_unstemmed Evaluation of data discretization methods to derive platform independent isoform expression signatures for multi-class tumor subtyping
title_short Evaluation of data discretization methods to derive platform independent isoform expression signatures for multi-class tumor subtyping
title_sort evaluation of data discretization methods to derive platform independent isoform expression signatures for multi-class tumor subtyping
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4652565/
https://www.ncbi.nlm.nih.gov/pubmed/26576613
http://dx.doi.org/10.1186/1471-2164-16-S11-S3
work_keys_str_mv AT jungsegun evaluationofdatadiscretizationmethodstoderiveplatformindependentisoformexpressionsignaturesformulticlasstumorsubtyping
AT biyingtao evaluationofdatadiscretizationmethodstoderiveplatformindependentisoformexpressionsignaturesformulticlasstumorsubtyping
AT davuluriramanav evaluationofdatadiscretizationmethodstoderiveplatformindependentisoformexpressionsignaturesformulticlasstumorsubtyping