Cargando…

Tissue-aware RNA-Seq processing and normalization for heterogeneous and sparse data

BACKGROUND: Although ultrahigh-throughput RNA-Sequencing has become the dominant technology for genome-wide transcriptional profiling, the vast majority of RNA-Seq studies typically profile only tens of samples, and most analytical pipelines are optimized for these smaller studies. However, projects...

Descripción completa

Detalles Bibliográficos
Autores principales: Paulson, Joseph N., Chen, Cho-Yi, Lopes-Ramos, Camila M., Kuijjer, Marieke L., Platig, John, Sonawane, Abhijeet R., Fagny, Maud, Glass, Kimberly, Quackenbush, John
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5627434/
https://www.ncbi.nlm.nih.gov/pubmed/28974199
http://dx.doi.org/10.1186/s12859-017-1847-x
_version_ 1783268715630952448
author Paulson, Joseph N.
Chen, Cho-Yi
Lopes-Ramos, Camila M.
Kuijjer, Marieke L.
Platig, John
Sonawane, Abhijeet R.
Fagny, Maud
Glass, Kimberly
Quackenbush, John
author_facet Paulson, Joseph N.
Chen, Cho-Yi
Lopes-Ramos, Camila M.
Kuijjer, Marieke L.
Platig, John
Sonawane, Abhijeet R.
Fagny, Maud
Glass, Kimberly
Quackenbush, John
author_sort Paulson, Joseph N.
collection PubMed
description BACKGROUND: Although ultrahigh-throughput RNA-Sequencing has become the dominant technology for genome-wide transcriptional profiling, the vast majority of RNA-Seq studies typically profile only tens of samples, and most analytical pipelines are optimized for these smaller studies. However, projects are generating ever-larger data sets comprising RNA-Seq data from hundreds or thousands of samples, often collected at multiple centers and from diverse tissues. These complex data sets present significant analytical challenges due to batch and tissue effects, but provide the opportunity to revisit the assumptions and methods that we use to preprocess, normalize, and filter RNA-Seq data – critical first steps for any subsequent analysis. RESULTS: We find that analysis of large RNA-Seq data sets requires both careful quality control and the need to account for sparsity due to the heterogeneity intrinsic in multi-group studies. We developed Yet Another RNA Normalization software pipeline (YARN), that includes quality control and preprocessing, gene filtering, and normalization steps designed to facilitate downstream analysis of large, heterogeneous RNA-Seq data sets and we demonstrate its use with data from the Genotype-Tissue Expression (GTEx) project. CONCLUSIONS: An R package instantiating YARN is available at http://bioconductor.org/packages/yarn. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-017-1847-x) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-5627434
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-56274342017-10-12 Tissue-aware RNA-Seq processing and normalization for heterogeneous and sparse data Paulson, Joseph N. Chen, Cho-Yi Lopes-Ramos, Camila M. Kuijjer, Marieke L. Platig, John Sonawane, Abhijeet R. Fagny, Maud Glass, Kimberly Quackenbush, John BMC Bioinformatics Software BACKGROUND: Although ultrahigh-throughput RNA-Sequencing has become the dominant technology for genome-wide transcriptional profiling, the vast majority of RNA-Seq studies typically profile only tens of samples, and most analytical pipelines are optimized for these smaller studies. However, projects are generating ever-larger data sets comprising RNA-Seq data from hundreds or thousands of samples, often collected at multiple centers and from diverse tissues. These complex data sets present significant analytical challenges due to batch and tissue effects, but provide the opportunity to revisit the assumptions and methods that we use to preprocess, normalize, and filter RNA-Seq data – critical first steps for any subsequent analysis. RESULTS: We find that analysis of large RNA-Seq data sets requires both careful quality control and the need to account for sparsity due to the heterogeneity intrinsic in multi-group studies. We developed Yet Another RNA Normalization software pipeline (YARN), that includes quality control and preprocessing, gene filtering, and normalization steps designed to facilitate downstream analysis of large, heterogeneous RNA-Seq data sets and we demonstrate its use with data from the Genotype-Tissue Expression (GTEx) project. CONCLUSIONS: An R package instantiating YARN is available at http://bioconductor.org/packages/yarn. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-017-1847-x) contains supplementary material, which is available to authorized users. BioMed Central 2017-10-03 /pmc/articles/PMC5627434/ /pubmed/28974199 http://dx.doi.org/10.1186/s12859-017-1847-x Text en © The Author(s). 2017 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Software
Paulson, Joseph N.
Chen, Cho-Yi
Lopes-Ramos, Camila M.
Kuijjer, Marieke L.
Platig, John
Sonawane, Abhijeet R.
Fagny, Maud
Glass, Kimberly
Quackenbush, John
Tissue-aware RNA-Seq processing and normalization for heterogeneous and sparse data
title Tissue-aware RNA-Seq processing and normalization for heterogeneous and sparse data
title_full Tissue-aware RNA-Seq processing and normalization for heterogeneous and sparse data
title_fullStr Tissue-aware RNA-Seq processing and normalization for heterogeneous and sparse data
title_full_unstemmed Tissue-aware RNA-Seq processing and normalization for heterogeneous and sparse data
title_short Tissue-aware RNA-Seq processing and normalization for heterogeneous and sparse data
title_sort tissue-aware rna-seq processing and normalization for heterogeneous and sparse data
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5627434/
https://www.ncbi.nlm.nih.gov/pubmed/28974199
http://dx.doi.org/10.1186/s12859-017-1847-x
work_keys_str_mv AT paulsonjosephn tissueawarernaseqprocessingandnormalizationforheterogeneousandsparsedata
AT chenchoyi tissueawarernaseqprocessingandnormalizationforheterogeneousandsparsedata
AT lopesramoscamilam tissueawarernaseqprocessingandnormalizationforheterogeneousandsparsedata
AT kuijjermariekel tissueawarernaseqprocessingandnormalizationforheterogeneousandsparsedata
AT platigjohn tissueawarernaseqprocessingandnormalizationforheterogeneousandsparsedata
AT sonawaneabhijeetr tissueawarernaseqprocessingandnormalizationforheterogeneousandsparsedata
AT fagnymaud tissueawarernaseqprocessingandnormalizationforheterogeneousandsparsedata
AT glasskimberly tissueawarernaseqprocessingandnormalizationforheterogeneousandsparsedata
AT quackenbushjohn tissueawarernaseqprocessingandnormalizationforheterogeneousandsparsedata