Cargando…

A comprehensive simulation study on classification of RNA-Seq data

RNA sequencing (RNA-Seq) is a powerful technique for the gene-expression profiling of organisms that uses the capabilities of next-generation sequencing technologies. Developing gene-expression-based classification algorithms is an emerging powerful method for diagnosis, disease classification and m...

Descripción completa

Detalles Bibliográficos
Autores principales: Zararsız, Gökmen, Goksuluk, Dincer, Korkmaz, Selcuk, Eldem, Vahap, Zararsiz, Gozde Erturk, Duru, Izzet Parug, Ozturk, Ahmet
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5568128/
https://www.ncbi.nlm.nih.gov/pubmed/28832679
http://dx.doi.org/10.1371/journal.pone.0182507
_version_ 1783258802730041344
author Zararsız, Gökmen
Goksuluk, Dincer
Korkmaz, Selcuk
Eldem, Vahap
Zararsiz, Gozde Erturk
Duru, Izzet Parug
Ozturk, Ahmet
author_facet Zararsız, Gökmen
Goksuluk, Dincer
Korkmaz, Selcuk
Eldem, Vahap
Zararsiz, Gozde Erturk
Duru, Izzet Parug
Ozturk, Ahmet
author_sort Zararsız, Gökmen
collection PubMed
description RNA sequencing (RNA-Seq) is a powerful technique for the gene-expression profiling of organisms that uses the capabilities of next-generation sequencing technologies. Developing gene-expression-based classification algorithms is an emerging powerful method for diagnosis, disease classification and monitoring at molecular level, as well as providing potential markers of diseases. Most of the statistical methods proposed for the classification of gene-expression data are either based on a continuous scale (eg. microarray data) or require a normal distribution assumption. Hence, these methods cannot be directly applied to RNA-Seq data since they violate both data structure and distributional assumptions. However, it is possible to apply these algorithms with appropriate modifications to RNA-Seq data. One way is to develop count-based classifiers, such as Poisson linear discriminant analysis and negative binomial linear discriminant analysis. Another way is to bring the data closer to microarrays and apply microarray-based classifiers. In this study, we compared several classifiers including PLDA with and without power transformation, NBLDA, single SVM, bagging SVM (bagSVM), classification and regression trees (CART), and random forests (RF). We also examined the effect of several parameters such as overdispersion, sample size, number of genes, number of classes, differential-expression rate, and the transformation method on model performances. A comprehensive simulation study is conducted and the results are compared with the results of two miRNA and two mRNA experimental datasets. The results revealed that increasing the sample size, differential-expression rate and decreasing the dispersion parameter and number of groups lead to an increase in classification accuracy. Similar with differential-expression studies, the classification of RNA-Seq data requires careful attention when handling data overdispersion. We conclude that, as a count-based classifier, the power transformed PLDA and, as a microarray-based classifier, vst or rlog transformed RF and SVM classifiers may be a good choice for classification. An R/BIOCONDUCTOR package, MLSeq, is freely available at https://www.bioconductor.org/packages/release/bioc/html/MLSeq.html.
format Online
Article
Text
id pubmed-5568128
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-55681282017-09-09 A comprehensive simulation study on classification of RNA-Seq data Zararsız, Gökmen Goksuluk, Dincer Korkmaz, Selcuk Eldem, Vahap Zararsiz, Gozde Erturk Duru, Izzet Parug Ozturk, Ahmet PLoS One Research Article RNA sequencing (RNA-Seq) is a powerful technique for the gene-expression profiling of organisms that uses the capabilities of next-generation sequencing technologies. Developing gene-expression-based classification algorithms is an emerging powerful method for diagnosis, disease classification and monitoring at molecular level, as well as providing potential markers of diseases. Most of the statistical methods proposed for the classification of gene-expression data are either based on a continuous scale (eg. microarray data) or require a normal distribution assumption. Hence, these methods cannot be directly applied to RNA-Seq data since they violate both data structure and distributional assumptions. However, it is possible to apply these algorithms with appropriate modifications to RNA-Seq data. One way is to develop count-based classifiers, such as Poisson linear discriminant analysis and negative binomial linear discriminant analysis. Another way is to bring the data closer to microarrays and apply microarray-based classifiers. In this study, we compared several classifiers including PLDA with and without power transformation, NBLDA, single SVM, bagging SVM (bagSVM), classification and regression trees (CART), and random forests (RF). We also examined the effect of several parameters such as overdispersion, sample size, number of genes, number of classes, differential-expression rate, and the transformation method on model performances. A comprehensive simulation study is conducted and the results are compared with the results of two miRNA and two mRNA experimental datasets. The results revealed that increasing the sample size, differential-expression rate and decreasing the dispersion parameter and number of groups lead to an increase in classification accuracy. Similar with differential-expression studies, the classification of RNA-Seq data requires careful attention when handling data overdispersion. We conclude that, as a count-based classifier, the power transformed PLDA and, as a microarray-based classifier, vst or rlog transformed RF and SVM classifiers may be a good choice for classification. An R/BIOCONDUCTOR package, MLSeq, is freely available at https://www.bioconductor.org/packages/release/bioc/html/MLSeq.html. Public Library of Science 2017-08-23 /pmc/articles/PMC5568128/ /pubmed/28832679 http://dx.doi.org/10.1371/journal.pone.0182507 Text en © 2017 Zararsız et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Zararsız, Gökmen
Goksuluk, Dincer
Korkmaz, Selcuk
Eldem, Vahap
Zararsiz, Gozde Erturk
Duru, Izzet Parug
Ozturk, Ahmet
A comprehensive simulation study on classification of RNA-Seq data
title A comprehensive simulation study on classification of RNA-Seq data
title_full A comprehensive simulation study on classification of RNA-Seq data
title_fullStr A comprehensive simulation study on classification of RNA-Seq data
title_full_unstemmed A comprehensive simulation study on classification of RNA-Seq data
title_short A comprehensive simulation study on classification of RNA-Seq data
title_sort comprehensive simulation study on classification of rna-seq data
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5568128/
https://www.ncbi.nlm.nih.gov/pubmed/28832679
http://dx.doi.org/10.1371/journal.pone.0182507
work_keys_str_mv AT zararsızgokmen acomprehensivesimulationstudyonclassificationofrnaseqdata
AT goksulukdincer acomprehensivesimulationstudyonclassificationofrnaseqdata
AT korkmazselcuk acomprehensivesimulationstudyonclassificationofrnaseqdata
AT eldemvahap acomprehensivesimulationstudyonclassificationofrnaseqdata
AT zararsizgozdeerturk acomprehensivesimulationstudyonclassificationofrnaseqdata
AT duruizzetparug acomprehensivesimulationstudyonclassificationofrnaseqdata
AT ozturkahmet acomprehensivesimulationstudyonclassificationofrnaseqdata
AT zararsızgokmen comprehensivesimulationstudyonclassificationofrnaseqdata
AT goksulukdincer comprehensivesimulationstudyonclassificationofrnaseqdata
AT korkmazselcuk comprehensivesimulationstudyonclassificationofrnaseqdata
AT eldemvahap comprehensivesimulationstudyonclassificationofrnaseqdata
AT zararsizgozdeerturk comprehensivesimulationstudyonclassificationofrnaseqdata
AT duruizzetparug comprehensivesimulationstudyonclassificationofrnaseqdata
AT ozturkahmet comprehensivesimulationstudyonclassificationofrnaseqdata