Cargando…

Differential gene expression analysis tools exhibit substandard performance for long non-coding RNA-sequencing data

BACKGROUND: Long non-coding RNAs (lncRNAs) are typically expressed at low levels and are inherently highly variable. This is a fundamental challenge for differential expression (DE) analysis. In this study, the performance of 25 pipelines for testing DE in RNA-seq data is comprehensively evaluated,...

Descripción completa

Detalles Bibliográficos
Autores principales: Assefa, Alemu Takele, De Paepe, Katrijn, Everaert, Celine, Mestdagh, Pieter, Thas, Olivier, Vandesompele, Jo
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6058388/
https://www.ncbi.nlm.nih.gov/pubmed/30041657
http://dx.doi.org/10.1186/s13059-018-1466-5
_version_ 1783341683994263552
author Assefa, Alemu Takele
De Paepe, Katrijn
Everaert, Celine
Mestdagh, Pieter
Thas, Olivier
Vandesompele, Jo
author_facet Assefa, Alemu Takele
De Paepe, Katrijn
Everaert, Celine
Mestdagh, Pieter
Thas, Olivier
Vandesompele, Jo
author_sort Assefa, Alemu Takele
collection PubMed
description BACKGROUND: Long non-coding RNAs (lncRNAs) are typically expressed at low levels and are inherently highly variable. This is a fundamental challenge for differential expression (DE) analysis. In this study, the performance of 25 pipelines for testing DE in RNA-seq data is comprehensively evaluated, with a particular focus on lncRNAs and low-abundance mRNAs. Fifteen performance metrics are used to evaluate DE tools and normalization methods using simulations and analyses of six diverse RNA-seq datasets. RESULTS: Gene expression data are simulated using non-parametric procedures in such a way that realistic levels of expression and variability are preserved in the simulated data. Throughout the assessment, results for mRNA and lncRNA were tracked separately. All the pipelines exhibit inferior performance for lncRNAs compared to mRNAs across all simulated scenarios and benchmark RNA-seq datasets. The substandard performance of DE tools for lncRNAs applies also to low-abundance mRNAs. No single tool uniformly outperformed the others. Variability, number of samples, and fraction of DE genes markedly influenced DE tool performance. CONCLUSIONS: Overall, linear modeling with empirical Bayes moderation (limma) and a non-parametric approach (SAMSeq) showed good control of the false discovery rate and reasonable sensitivity. Of note, for achieving a sensitivity of at least 50%, more than 80 samples are required when studying expression levels in realistic settings such as in clinical cancer research. About half of the methods showed a substantial excess of false discoveries, making these methods unreliable for DE analysis and jeopardizing reproducible science. The detailed results of our study can be consulted through a user-friendly web application, giving guidance on selection of the optimal DE tool (http://statapps.ugent.be/tools/AppDGE/). ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s13059-018-1466-5) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-6058388
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-60583882018-07-30 Differential gene expression analysis tools exhibit substandard performance for long non-coding RNA-sequencing data Assefa, Alemu Takele De Paepe, Katrijn Everaert, Celine Mestdagh, Pieter Thas, Olivier Vandesompele, Jo Genome Biol Research BACKGROUND: Long non-coding RNAs (lncRNAs) are typically expressed at low levels and are inherently highly variable. This is a fundamental challenge for differential expression (DE) analysis. In this study, the performance of 25 pipelines for testing DE in RNA-seq data is comprehensively evaluated, with a particular focus on lncRNAs and low-abundance mRNAs. Fifteen performance metrics are used to evaluate DE tools and normalization methods using simulations and analyses of six diverse RNA-seq datasets. RESULTS: Gene expression data are simulated using non-parametric procedures in such a way that realistic levels of expression and variability are preserved in the simulated data. Throughout the assessment, results for mRNA and lncRNA were tracked separately. All the pipelines exhibit inferior performance for lncRNAs compared to mRNAs across all simulated scenarios and benchmark RNA-seq datasets. The substandard performance of DE tools for lncRNAs applies also to low-abundance mRNAs. No single tool uniformly outperformed the others. Variability, number of samples, and fraction of DE genes markedly influenced DE tool performance. CONCLUSIONS: Overall, linear modeling with empirical Bayes moderation (limma) and a non-parametric approach (SAMSeq) showed good control of the false discovery rate and reasonable sensitivity. Of note, for achieving a sensitivity of at least 50%, more than 80 samples are required when studying expression levels in realistic settings such as in clinical cancer research. About half of the methods showed a substantial excess of false discoveries, making these methods unreliable for DE analysis and jeopardizing reproducible science. The detailed results of our study can be consulted through a user-friendly web application, giving guidance on selection of the optimal DE tool (http://statapps.ugent.be/tools/AppDGE/). ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s13059-018-1466-5) contains supplementary material, which is available to authorized users. BioMed Central 2018-07-24 /pmc/articles/PMC6058388/ /pubmed/30041657 http://dx.doi.org/10.1186/s13059-018-1466-5 Text en © The Author(s). 2018 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Assefa, Alemu Takele
De Paepe, Katrijn
Everaert, Celine
Mestdagh, Pieter
Thas, Olivier
Vandesompele, Jo
Differential gene expression analysis tools exhibit substandard performance for long non-coding RNA-sequencing data
title Differential gene expression analysis tools exhibit substandard performance for long non-coding RNA-sequencing data
title_full Differential gene expression analysis tools exhibit substandard performance for long non-coding RNA-sequencing data
title_fullStr Differential gene expression analysis tools exhibit substandard performance for long non-coding RNA-sequencing data
title_full_unstemmed Differential gene expression analysis tools exhibit substandard performance for long non-coding RNA-sequencing data
title_short Differential gene expression analysis tools exhibit substandard performance for long non-coding RNA-sequencing data
title_sort differential gene expression analysis tools exhibit substandard performance for long non-coding rna-sequencing data
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6058388/
https://www.ncbi.nlm.nih.gov/pubmed/30041657
http://dx.doi.org/10.1186/s13059-018-1466-5
work_keys_str_mv AT assefaalemutakele differentialgeneexpressionanalysistoolsexhibitsubstandardperformanceforlongnoncodingrnasequencingdata
AT depaepekatrijn differentialgeneexpressionanalysistoolsexhibitsubstandardperformanceforlongnoncodingrnasequencingdata
AT everaertceline differentialgeneexpressionanalysistoolsexhibitsubstandardperformanceforlongnoncodingrnasequencingdata
AT mestdaghpieter differentialgeneexpressionanalysistoolsexhibitsubstandardperformanceforlongnoncodingrnasequencingdata
AT thasolivier differentialgeneexpressionanalysistoolsexhibitsubstandardperformanceforlongnoncodingrnasequencingdata
AT vandesompelejo differentialgeneexpressionanalysistoolsexhibitsubstandardperformanceforlongnoncodingrnasequencingdata