Cargando…

Identification of gene signatures from RNA-seq data using Pareto-optimal cluster algorithm

BACKGROUND: Gene signatures are important to represent the molecular changes in the disease genomes or the cells in specific conditions, and have been often used to separate samples into different groups for better research or clinical treatment. While many methods and applications have been availab...

Descripción completa

Detalles Bibliográficos
Autores principales:	Mallik, Saurav, Zhao, Zhongming
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2018
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6302366/ https://www.ncbi.nlm.nih.gov/pubmed/30577846 http://dx.doi.org/10.1186/s12918-018-0650-2

_version_	1783381962112630784
author	Mallik, Saurav Zhao, Zhongming
author_facet	Mallik, Saurav Zhao, Zhongming
author_sort	Mallik, Saurav
collection	PubMed
description	BACKGROUND: Gene signatures are important to represent the molecular changes in the disease genomes or the cells in specific conditions, and have been often used to separate samples into different groups for better research or clinical treatment. While many methods and applications have been available in literature, there still lack powerful ones that can take account of the complex data and detect the most informative signatures. METHODS: In this article, we present a new framework for identifying gene signatures using Pareto-optimal cluster size identification for RNA-seq data. We first performed pre-filtering steps and normalization, then utilized the empirical Bayes test in Limma package to identify the differentially expressed genes (DEGs). Next, we used a multi-objective optimization technique, “Multi-objective optimization for collecting cluster alternatives” (MOCCA in R package) on these DEGs to find Pareto-optimal cluster size, and then applied k-means clustering to the RNA-seq data based on the optimal cluster size. The best cluster was obtained through computing the average Spearman’s Correlation Score among all the genes in pair-wise manner belonging to the module. The best cluster is treated as the signature for the respective disease or cellular condition. RESULTS: We applied our framework to a cervical cancer RNA-seq dataset, which included 253 squamous cell carcinoma (SCC) samples and 22 adenocarcinoma (ADENO) samples. We identified a total of 582 DEGs by Limma analysis of SCC versus ADENO samples. Among them, 260 are up-regulated genes and 322 are down-regulated genes. Using MOCCA, we obtained seven Pareto-optimal clusters. The best cluster has a total of 35 DEGs consisting of all-upregulated genes. For validation, we ran PAMR (prediction analysis for microarrays) classifier on the selected best cluster, and assessed the classification performance. Our evaluation, measured by sensitivity, specificity, precision, and accuracy, showed high confidence. CONCLUSIONS: Our framework identified a multi-objective based cluster that is treated as a signature that can classify the disease and control group of samples with higher classification performance (accuracy 0.935) for the corresponding disease. Our method is useful to find signature for any RNA-seq or microarray data.
format	Online Article Text
id	pubmed-6302366
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-63023662018-12-31 Identification of gene signatures from RNA-seq data using Pareto-optimal cluster algorithm Mallik, Saurav Zhao, Zhongming BMC Syst Biol Research BACKGROUND: Gene signatures are important to represent the molecular changes in the disease genomes or the cells in specific conditions, and have been often used to separate samples into different groups for better research or clinical treatment. While many methods and applications have been available in literature, there still lack powerful ones that can take account of the complex data and detect the most informative signatures. METHODS: In this article, we present a new framework for identifying gene signatures using Pareto-optimal cluster size identification for RNA-seq data. We first performed pre-filtering steps and normalization, then utilized the empirical Bayes test in Limma package to identify the differentially expressed genes (DEGs). Next, we used a multi-objective optimization technique, “Multi-objective optimization for collecting cluster alternatives” (MOCCA in R package) on these DEGs to find Pareto-optimal cluster size, and then applied k-means clustering to the RNA-seq data based on the optimal cluster size. The best cluster was obtained through computing the average Spearman’s Correlation Score among all the genes in pair-wise manner belonging to the module. The best cluster is treated as the signature for the respective disease or cellular condition. RESULTS: We applied our framework to a cervical cancer RNA-seq dataset, which included 253 squamous cell carcinoma (SCC) samples and 22 adenocarcinoma (ADENO) samples. We identified a total of 582 DEGs by Limma analysis of SCC versus ADENO samples. Among them, 260 are up-regulated genes and 322 are down-regulated genes. Using MOCCA, we obtained seven Pareto-optimal clusters. The best cluster has a total of 35 DEGs consisting of all-upregulated genes. For validation, we ran PAMR (prediction analysis for microarrays) classifier on the selected best cluster, and assessed the classification performance. Our evaluation, measured by sensitivity, specificity, precision, and accuracy, showed high confidence. CONCLUSIONS: Our framework identified a multi-objective based cluster that is treated as a signature that can classify the disease and control group of samples with higher classification performance (accuracy 0.935) for the corresponding disease. Our method is useful to find signature for any RNA-seq or microarray data. BioMed Central 2018-12-21 /pmc/articles/PMC6302366/ /pubmed/30577846 http://dx.doi.org/10.1186/s12918-018-0650-2 Text en © The Author(s) 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Mallik, Saurav Zhao, Zhongming Identification of gene signatures from RNA-seq data using Pareto-optimal cluster algorithm
title	Identification of gene signatures from RNA-seq data using Pareto-optimal cluster algorithm
title_full	Identification of gene signatures from RNA-seq data using Pareto-optimal cluster algorithm
title_fullStr	Identification of gene signatures from RNA-seq data using Pareto-optimal cluster algorithm
title_full_unstemmed	Identification of gene signatures from RNA-seq data using Pareto-optimal cluster algorithm
title_short	Identification of gene signatures from RNA-seq data using Pareto-optimal cluster algorithm
title_sort	identification of gene signatures from rna-seq data using pareto-optimal cluster algorithm
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6302366/ https://www.ncbi.nlm.nih.gov/pubmed/30577846 http://dx.doi.org/10.1186/s12918-018-0650-2
work_keys_str_mv	AT malliksaurav identificationofgenesignaturesfromrnaseqdatausingparetooptimalclusteralgorithm AT zhaozhongming identificationofgenesignaturesfromrnaseqdatausingparetooptimalclusteralgorithm

Identification of gene signatures from RNA-seq data using Pareto-optimal cluster algorithm

Ejemplares similares