Cargando…

Assessment of data transformations for model-based clustering of RNA-Seq data

Quality control, global biases, normalization, and analysis methods for RNA-Seq data are quite different than those for microarray-based studies. The assumption of normality is reasonable for microarray based gene expression data; however, RNA-Seq data tend to follow an over-dispersed Poisson or neg...

Descripción completa

Detalles Bibliográficos
Autores principales: Noel-MacDonnell, Janelle R., Usset, Joseph, Goode, Ellen L., Fridley, Brooke L.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5828440/
https://www.ncbi.nlm.nih.gov/pubmed/29485993
http://dx.doi.org/10.1371/journal.pone.0191758
_version_ 1783302646903341056
author Noel-MacDonnell, Janelle R.
Usset, Joseph
Goode, Ellen L.
Fridley, Brooke L.
author_facet Noel-MacDonnell, Janelle R.
Usset, Joseph
Goode, Ellen L.
Fridley, Brooke L.
author_sort Noel-MacDonnell, Janelle R.
collection PubMed
description Quality control, global biases, normalization, and analysis methods for RNA-Seq data are quite different than those for microarray-based studies. The assumption of normality is reasonable for microarray based gene expression data; however, RNA-Seq data tend to follow an over-dispersed Poisson or negative binomial distribution. Little research has been done to assess how data transformations impact Gaussian model-based clustering with respect to clustering performance and accuracy in estimating the correct number of clusters in RNA-Seq data. In this article, we investigate Gaussian model-based clustering performance and accuracy in estimating the correct number of clusters by applying four data transformations (i.e., naïve, logarithmic, Blom, and variance stabilizing transformation) to simulated RNA-Seq data. To do so, an extensive simulation study was carried out in which the scenarios varied in terms of: how genes were selected to be included in the clustering analyses, size of the clusters, and number of clusters. Following the application of the different transformations to the simulated data, Gaussian model-based clustering was carried out. To assess clustering performance for each of the data transformations, the adjusted rand index, clustering error rate, and concordance index were utilized. As expected, our results showed that clustering performance was gained in scenarios where data transformations were applied to make the data appear “more” Gaussian in distribution.
format Online
Article
Text
id pubmed-5828440
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-58284402018-03-19 Assessment of data transformations for model-based clustering of RNA-Seq data Noel-MacDonnell, Janelle R. Usset, Joseph Goode, Ellen L. Fridley, Brooke L. PLoS One Research Article Quality control, global biases, normalization, and analysis methods for RNA-Seq data are quite different than those for microarray-based studies. The assumption of normality is reasonable for microarray based gene expression data; however, RNA-Seq data tend to follow an over-dispersed Poisson or negative binomial distribution. Little research has been done to assess how data transformations impact Gaussian model-based clustering with respect to clustering performance and accuracy in estimating the correct number of clusters in RNA-Seq data. In this article, we investigate Gaussian model-based clustering performance and accuracy in estimating the correct number of clusters by applying four data transformations (i.e., naïve, logarithmic, Blom, and variance stabilizing transformation) to simulated RNA-Seq data. To do so, an extensive simulation study was carried out in which the scenarios varied in terms of: how genes were selected to be included in the clustering analyses, size of the clusters, and number of clusters. Following the application of the different transformations to the simulated data, Gaussian model-based clustering was carried out. To assess clustering performance for each of the data transformations, the adjusted rand index, clustering error rate, and concordance index were utilized. As expected, our results showed that clustering performance was gained in scenarios where data transformations were applied to make the data appear “more” Gaussian in distribution. Public Library of Science 2018-02-27 /pmc/articles/PMC5828440/ /pubmed/29485993 http://dx.doi.org/10.1371/journal.pone.0191758 Text en © 2018 Noel-MacDonnell et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Noel-MacDonnell, Janelle R.
Usset, Joseph
Goode, Ellen L.
Fridley, Brooke L.
Assessment of data transformations for model-based clustering of RNA-Seq data
title Assessment of data transformations for model-based clustering of RNA-Seq data
title_full Assessment of data transformations for model-based clustering of RNA-Seq data
title_fullStr Assessment of data transformations for model-based clustering of RNA-Seq data
title_full_unstemmed Assessment of data transformations for model-based clustering of RNA-Seq data
title_short Assessment of data transformations for model-based clustering of RNA-Seq data
title_sort assessment of data transformations for model-based clustering of rna-seq data
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5828440/
https://www.ncbi.nlm.nih.gov/pubmed/29485993
http://dx.doi.org/10.1371/journal.pone.0191758
work_keys_str_mv AT noelmacdonnelljaneller assessmentofdatatransformationsformodelbasedclusteringofrnaseqdata
AT ussetjoseph assessmentofdatatransformationsformodelbasedclusteringofrnaseqdata
AT goodeellenl assessmentofdatatransformationsformodelbasedclusteringofrnaseqdata
AT fridleybrookel assessmentofdatatransformationsformodelbasedclusteringofrnaseqdata