Cargando…
Evaluation of data discretization methods to derive platform independent isoform expression signatures for multi-class tumor subtyping
BACKGROUND: Many supervised learning algorithms have been applied in deriving gene signatures for patient stratification from gene expression data. However, transferring the multi-gene signatures from one analytical platform to another without loss of classification accuracy is a major challenge. He...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2015
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4652565/ https://www.ncbi.nlm.nih.gov/pubmed/26576613 http://dx.doi.org/10.1186/1471-2164-16-S11-S3 |
_version_ | 1782401781735620608 |
---|---|
author | Jung, Segun Bi, Yingtao Davuluri, Ramana V |
author_facet | Jung, Segun Bi, Yingtao Davuluri, Ramana V |
author_sort | Jung, Segun |
collection | PubMed |
description | BACKGROUND: Many supervised learning algorithms have been applied in deriving gene signatures for patient stratification from gene expression data. However, transferring the multi-gene signatures from one analytical platform to another without loss of classification accuracy is a major challenge. Here, we compared three unsupervised data discretization methods--Equal-width binning, Equal-frequency binning, and k-means clustering--in accurately classifying the four known subtypes of glioblastoma multiforme (GBM) when the classification algorithms were trained on the isoform-level gene expression profiles from exon-array platform and tested on the corresponding profiles from RNA-seq data. RESULTS: We applied an integrated machine learning framework that involves three sequential steps; feature selection, data discretization, and classification. For models trained and tested on exon-array data, the addition of data discretization step led to robust and accurate predictive models with fewer number of variables in the final models. For models trained on exon-array data and tested on RNA-seq data, the addition of data discretization step dramatically improved the classification accuracies with Equal-frequency binning showing the highest improvement with more than 90% accuracies for all the models with features chosen by Random Forest based feature selection. Overall, SVM classifier coupled with Equal-frequency binning achieved the best accuracy (> 95%). Without data discretization, however, only 73.6% accuracy was achieved at most. CONCLUSIONS: The classification algorithms, trained and tested on data from the same platform, yielded similar accuracies in predicting the four GBM subgroups. However, when dealing with cross-platform data, from exon-array to RNA-seq, the classifiers yielded stable models with highest classification accuracies on data transformed by Equal frequency binning. The approach presented here is generally applicable to other cancer types for classification and identification of molecular subgroups by integrating data across different gene expression platforms. |
format | Online Article Text |
id | pubmed-4652565 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2015 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-46525652015-11-25 Evaluation of data discretization methods to derive platform independent isoform expression signatures for multi-class tumor subtyping Jung, Segun Bi, Yingtao Davuluri, Ramana V BMC Genomics Research BACKGROUND: Many supervised learning algorithms have been applied in deriving gene signatures for patient stratification from gene expression data. However, transferring the multi-gene signatures from one analytical platform to another without loss of classification accuracy is a major challenge. Here, we compared three unsupervised data discretization methods--Equal-width binning, Equal-frequency binning, and k-means clustering--in accurately classifying the four known subtypes of glioblastoma multiforme (GBM) when the classification algorithms were trained on the isoform-level gene expression profiles from exon-array platform and tested on the corresponding profiles from RNA-seq data. RESULTS: We applied an integrated machine learning framework that involves three sequential steps; feature selection, data discretization, and classification. For models trained and tested on exon-array data, the addition of data discretization step led to robust and accurate predictive models with fewer number of variables in the final models. For models trained on exon-array data and tested on RNA-seq data, the addition of data discretization step dramatically improved the classification accuracies with Equal-frequency binning showing the highest improvement with more than 90% accuracies for all the models with features chosen by Random Forest based feature selection. Overall, SVM classifier coupled with Equal-frequency binning achieved the best accuracy (> 95%). Without data discretization, however, only 73.6% accuracy was achieved at most. CONCLUSIONS: The classification algorithms, trained and tested on data from the same platform, yielded similar accuracies in predicting the four GBM subgroups. However, when dealing with cross-platform data, from exon-array to RNA-seq, the classifiers yielded stable models with highest classification accuracies on data transformed by Equal frequency binning. The approach presented here is generally applicable to other cancer types for classification and identification of molecular subgroups by integrating data across different gene expression platforms. BioMed Central 2015-11-10 /pmc/articles/PMC4652565/ /pubmed/26576613 http://dx.doi.org/10.1186/1471-2164-16-S11-S3 Text en Copyright © 2015 Jung et al. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Jung, Segun Bi, Yingtao Davuluri, Ramana V Evaluation of data discretization methods to derive platform independent isoform expression signatures for multi-class tumor subtyping |
title | Evaluation of data discretization methods to derive platform independent isoform expression signatures for multi-class tumor subtyping |
title_full | Evaluation of data discretization methods to derive platform independent isoform expression signatures for multi-class tumor subtyping |
title_fullStr | Evaluation of data discretization methods to derive platform independent isoform expression signatures for multi-class tumor subtyping |
title_full_unstemmed | Evaluation of data discretization methods to derive platform independent isoform expression signatures for multi-class tumor subtyping |
title_short | Evaluation of data discretization methods to derive platform independent isoform expression signatures for multi-class tumor subtyping |
title_sort | evaluation of data discretization methods to derive platform independent isoform expression signatures for multi-class tumor subtyping |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4652565/ https://www.ncbi.nlm.nih.gov/pubmed/26576613 http://dx.doi.org/10.1186/1471-2164-16-S11-S3 |
work_keys_str_mv | AT jungsegun evaluationofdatadiscretizationmethodstoderiveplatformindependentisoformexpressionsignaturesformulticlasstumorsubtyping AT biyingtao evaluationofdatadiscretizationmethodstoderiveplatformindependentisoformexpressionsignaturesformulticlasstumorsubtyping AT davuluriramanav evaluationofdatadiscretizationmethodstoderiveplatformindependentisoformexpressionsignaturesformulticlasstumorsubtyping |