Cargando…

Identification of gene signatures from RNA-seq data using Pareto-optimal cluster algorithm

BACKGROUND: Gene signatures are important to represent the molecular changes in the disease genomes or the cells in specific conditions, and have been often used to separate samples into different groups for better research or clinical treatment. While many methods and applications have been availab...

Descripción completa

Detalles Bibliográficos
Autores principales: Mallik, Saurav, Zhao, Zhongming
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6302366/
https://www.ncbi.nlm.nih.gov/pubmed/30577846
http://dx.doi.org/10.1186/s12918-018-0650-2
_version_ 1783381962112630784
author Mallik, Saurav
Zhao, Zhongming
author_facet Mallik, Saurav
Zhao, Zhongming
author_sort Mallik, Saurav
collection PubMed
description BACKGROUND: Gene signatures are important to represent the molecular changes in the disease genomes or the cells in specific conditions, and have been often used to separate samples into different groups for better research or clinical treatment. While many methods and applications have been available in literature, there still lack powerful ones that can take account of the complex data and detect the most informative signatures. METHODS: In this article, we present a new framework for identifying gene signatures using Pareto-optimal cluster size identification for RNA-seq data. We first performed pre-filtering steps and normalization, then utilized the empirical Bayes test in Limma package to identify the differentially expressed genes (DEGs). Next, we used a multi-objective optimization technique, “Multi-objective optimization for collecting cluster alternatives” (MOCCA in R package) on these DEGs to find Pareto-optimal cluster size, and then applied k-means clustering to the RNA-seq data based on the optimal cluster size. The best cluster was obtained through computing the average Spearman’s Correlation Score among all the genes in pair-wise manner belonging to the module. The best cluster is treated as the signature for the respective disease or cellular condition. RESULTS: We applied our framework to a cervical cancer RNA-seq dataset, which included 253 squamous cell carcinoma (SCC) samples and 22 adenocarcinoma (ADENO) samples. We identified a total of 582 DEGs by Limma analysis of SCC versus ADENO samples. Among them, 260 are up-regulated genes and 322 are down-regulated genes. Using MOCCA, we obtained seven Pareto-optimal clusters. The best cluster has a total of 35 DEGs consisting of all-upregulated genes. For validation, we ran PAMR (prediction analysis for microarrays) classifier on the selected best cluster, and assessed the classification performance. Our evaluation, measured by sensitivity, specificity, precision, and accuracy, showed high confidence. CONCLUSIONS: Our framework identified a multi-objective based cluster that is treated as a signature that can classify the disease and control group of samples with higher classification performance (accuracy 0.935) for the corresponding disease. Our method is useful to find signature for any RNA-seq or microarray data.
format Online
Article
Text
id pubmed-6302366
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-63023662018-12-31 Identification of gene signatures from RNA-seq data using Pareto-optimal cluster algorithm Mallik, Saurav Zhao, Zhongming BMC Syst Biol Research BACKGROUND: Gene signatures are important to represent the molecular changes in the disease genomes or the cells in specific conditions, and have been often used to separate samples into different groups for better research or clinical treatment. While many methods and applications have been available in literature, there still lack powerful ones that can take account of the complex data and detect the most informative signatures. METHODS: In this article, we present a new framework for identifying gene signatures using Pareto-optimal cluster size identification for RNA-seq data. We first performed pre-filtering steps and normalization, then utilized the empirical Bayes test in Limma package to identify the differentially expressed genes (DEGs). Next, we used a multi-objective optimization technique, “Multi-objective optimization for collecting cluster alternatives” (MOCCA in R package) on these DEGs to find Pareto-optimal cluster size, and then applied k-means clustering to the RNA-seq data based on the optimal cluster size. The best cluster was obtained through computing the average Spearman’s Correlation Score among all the genes in pair-wise manner belonging to the module. The best cluster is treated as the signature for the respective disease or cellular condition. RESULTS: We applied our framework to a cervical cancer RNA-seq dataset, which included 253 squamous cell carcinoma (SCC) samples and 22 adenocarcinoma (ADENO) samples. We identified a total of 582 DEGs by Limma analysis of SCC versus ADENO samples. Among them, 260 are up-regulated genes and 322 are down-regulated genes. Using MOCCA, we obtained seven Pareto-optimal clusters. The best cluster has a total of 35 DEGs consisting of all-upregulated genes. For validation, we ran PAMR (prediction analysis for microarrays) classifier on the selected best cluster, and assessed the classification performance. Our evaluation, measured by sensitivity, specificity, precision, and accuracy, showed high confidence. CONCLUSIONS: Our framework identified a multi-objective based cluster that is treated as a signature that can classify the disease and control group of samples with higher classification performance (accuracy 0.935) for the corresponding disease. Our method is useful to find signature for any RNA-seq or microarray data. BioMed Central 2018-12-21 /pmc/articles/PMC6302366/ /pubmed/30577846 http://dx.doi.org/10.1186/s12918-018-0650-2 Text en © The Author(s) 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Mallik, Saurav
Zhao, Zhongming
Identification of gene signatures from RNA-seq data using Pareto-optimal cluster algorithm
title Identification of gene signatures from RNA-seq data using Pareto-optimal cluster algorithm
title_full Identification of gene signatures from RNA-seq data using Pareto-optimal cluster algorithm
title_fullStr Identification of gene signatures from RNA-seq data using Pareto-optimal cluster algorithm
title_full_unstemmed Identification of gene signatures from RNA-seq data using Pareto-optimal cluster algorithm
title_short Identification of gene signatures from RNA-seq data using Pareto-optimal cluster algorithm
title_sort identification of gene signatures from rna-seq data using pareto-optimal cluster algorithm
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6302366/
https://www.ncbi.nlm.nih.gov/pubmed/30577846
http://dx.doi.org/10.1186/s12918-018-0650-2
work_keys_str_mv AT malliksaurav identificationofgenesignaturesfromrnaseqdatausingparetooptimalclusteralgorithm
AT zhaozhongming identificationofgenesignaturesfromrnaseqdatausingparetooptimalclusteralgorithm