Cargando…
Identification of gene signatures from RNA-seq data using Pareto-optimal cluster algorithm
BACKGROUND: Gene signatures are important to represent the molecular changes in the disease genomes or the cells in specific conditions, and have been often used to separate samples into different groups for better research or clinical treatment. While many methods and applications have been availab...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2018
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6302366/ https://www.ncbi.nlm.nih.gov/pubmed/30577846 http://dx.doi.org/10.1186/s12918-018-0650-2 |
_version_ | 1783381962112630784 |
---|---|
author | Mallik, Saurav Zhao, Zhongming |
author_facet | Mallik, Saurav Zhao, Zhongming |
author_sort | Mallik, Saurav |
collection | PubMed |
description | BACKGROUND: Gene signatures are important to represent the molecular changes in the disease genomes or the cells in specific conditions, and have been often used to separate samples into different groups for better research or clinical treatment. While many methods and applications have been available in literature, there still lack powerful ones that can take account of the complex data and detect the most informative signatures. METHODS: In this article, we present a new framework for identifying gene signatures using Pareto-optimal cluster size identification for RNA-seq data. We first performed pre-filtering steps and normalization, then utilized the empirical Bayes test in Limma package to identify the differentially expressed genes (DEGs). Next, we used a multi-objective optimization technique, “Multi-objective optimization for collecting cluster alternatives” (MOCCA in R package) on these DEGs to find Pareto-optimal cluster size, and then applied k-means clustering to the RNA-seq data based on the optimal cluster size. The best cluster was obtained through computing the average Spearman’s Correlation Score among all the genes in pair-wise manner belonging to the module. The best cluster is treated as the signature for the respective disease or cellular condition. RESULTS: We applied our framework to a cervical cancer RNA-seq dataset, which included 253 squamous cell carcinoma (SCC) samples and 22 adenocarcinoma (ADENO) samples. We identified a total of 582 DEGs by Limma analysis of SCC versus ADENO samples. Among them, 260 are up-regulated genes and 322 are down-regulated genes. Using MOCCA, we obtained seven Pareto-optimal clusters. The best cluster has a total of 35 DEGs consisting of all-upregulated genes. For validation, we ran PAMR (prediction analysis for microarrays) classifier on the selected best cluster, and assessed the classification performance. Our evaluation, measured by sensitivity, specificity, precision, and accuracy, showed high confidence. CONCLUSIONS: Our framework identified a multi-objective based cluster that is treated as a signature that can classify the disease and control group of samples with higher classification performance (accuracy 0.935) for the corresponding disease. Our method is useful to find signature for any RNA-seq or microarray data. |
format | Online Article Text |
id | pubmed-6302366 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2018 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-63023662018-12-31 Identification of gene signatures from RNA-seq data using Pareto-optimal cluster algorithm Mallik, Saurav Zhao, Zhongming BMC Syst Biol Research BACKGROUND: Gene signatures are important to represent the molecular changes in the disease genomes or the cells in specific conditions, and have been often used to separate samples into different groups for better research or clinical treatment. While many methods and applications have been available in literature, there still lack powerful ones that can take account of the complex data and detect the most informative signatures. METHODS: In this article, we present a new framework for identifying gene signatures using Pareto-optimal cluster size identification for RNA-seq data. We first performed pre-filtering steps and normalization, then utilized the empirical Bayes test in Limma package to identify the differentially expressed genes (DEGs). Next, we used a multi-objective optimization technique, “Multi-objective optimization for collecting cluster alternatives” (MOCCA in R package) on these DEGs to find Pareto-optimal cluster size, and then applied k-means clustering to the RNA-seq data based on the optimal cluster size. The best cluster was obtained through computing the average Spearman’s Correlation Score among all the genes in pair-wise manner belonging to the module. The best cluster is treated as the signature for the respective disease or cellular condition. RESULTS: We applied our framework to a cervical cancer RNA-seq dataset, which included 253 squamous cell carcinoma (SCC) samples and 22 adenocarcinoma (ADENO) samples. We identified a total of 582 DEGs by Limma analysis of SCC versus ADENO samples. Among them, 260 are up-regulated genes and 322 are down-regulated genes. Using MOCCA, we obtained seven Pareto-optimal clusters. The best cluster has a total of 35 DEGs consisting of all-upregulated genes. For validation, we ran PAMR (prediction analysis for microarrays) classifier on the selected best cluster, and assessed the classification performance. Our evaluation, measured by sensitivity, specificity, precision, and accuracy, showed high confidence. CONCLUSIONS: Our framework identified a multi-objective based cluster that is treated as a signature that can classify the disease and control group of samples with higher classification performance (accuracy 0.935) for the corresponding disease. Our method is useful to find signature for any RNA-seq or microarray data. BioMed Central 2018-12-21 /pmc/articles/PMC6302366/ /pubmed/30577846 http://dx.doi.org/10.1186/s12918-018-0650-2 Text en © The Author(s) 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Mallik, Saurav Zhao, Zhongming Identification of gene signatures from RNA-seq data using Pareto-optimal cluster algorithm |
title | Identification of gene signatures from RNA-seq data using Pareto-optimal cluster algorithm |
title_full | Identification of gene signatures from RNA-seq data using Pareto-optimal cluster algorithm |
title_fullStr | Identification of gene signatures from RNA-seq data using Pareto-optimal cluster algorithm |
title_full_unstemmed | Identification of gene signatures from RNA-seq data using Pareto-optimal cluster algorithm |
title_short | Identification of gene signatures from RNA-seq data using Pareto-optimal cluster algorithm |
title_sort | identification of gene signatures from rna-seq data using pareto-optimal cluster algorithm |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6302366/ https://www.ncbi.nlm.nih.gov/pubmed/30577846 http://dx.doi.org/10.1186/s12918-018-0650-2 |
work_keys_str_mv | AT malliksaurav identificationofgenesignaturesfromrnaseqdatausingparetooptimalclusteralgorithm AT zhaozhongming identificationofgenesignaturesfromrnaseqdatausingparetooptimalclusteralgorithm |