Cargando…
Performance evaluation of lossy quality compression algorithms for RNA-seq data
BACKGROUND: Recent advancements in high-throughput sequencing technologies have generated an unprecedented amount of genomic data that must be stored, processed, and transmitted over the network for sharing. Lossy genomic data compression, especially of the base quality values of sequencing data, is...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7372835/ https://www.ncbi.nlm.nih.gov/pubmed/32689929 http://dx.doi.org/10.1186/s12859-020-03658-4 |
_version_ | 1783561391640150016 |
---|---|
author | Yu, Rongshan Yang, Wenxian Wang, Shun |
author_facet | Yu, Rongshan Yang, Wenxian Wang, Shun |
author_sort | Yu, Rongshan |
collection | PubMed |
description | BACKGROUND: Recent advancements in high-throughput sequencing technologies have generated an unprecedented amount of genomic data that must be stored, processed, and transmitted over the network for sharing. Lossy genomic data compression, especially of the base quality values of sequencing data, is emerging as an efficient way to handle this challenge due to its superior compression performance compared to lossless compression methods. Many lossy compression algorithms have been developed for and evaluated using DNA sequencing data. However, whether these algorithms can be used on RNA sequencing (RNA-seq) data remains unclear. RESULTS: In this study, we evaluated the impacts of lossy quality value compression on common RNA-seq data analysis pipelines including expression quantification, transcriptome assembly, and short variants detection using RNA-seq data from different species and sequencing platforms. Our study shows that lossy quality value compression could effectively improve RNA-seq data compression. In some cases, lossy algorithms achieved up to 1.2-3 times further reduction on the overall RNA-seq data size compared to existing lossless algorithms. However, lossy quality value compression could affect the results of some RNA-seq data processing pipelines, and hence its impacts to RNA-seq studies cannot be ignored in some cases. Pipelines using HISAT2 for alignment were most significantly affected by lossy quality value compression, while the effects of lossy compression on pipelines that do not depend on quality values, e.g., STAR-based expression quantification and transcriptome assembly pipelines, were not observed. Moreover, regardless of using either STAR or HISAT2 as the aligner, variant detection results were affected by lossy quality value compression, albeit to a lesser extent when STAR-based pipeline was used. Our results also show that the impacts of lossy quality value compression depend on the compression algorithms being used and the compression levels if the algorithm supports setting of multiple compression levels. CONCLUSIONS: Lossy quality value compression can be incorporated into existing RNA-seq analysis pipelines to alleviate the data storage and transmission burdens. However, care should be taken on the selection of compression tools and levels based on the requirements of the downstream analysis pipelines to avoid introducing undesirable adverse effects on the analysis results. |
format | Online Article Text |
id | pubmed-7372835 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-73728352020-07-21 Performance evaluation of lossy quality compression algorithms for RNA-seq data Yu, Rongshan Yang, Wenxian Wang, Shun BMC Bioinformatics Methodology Article BACKGROUND: Recent advancements in high-throughput sequencing technologies have generated an unprecedented amount of genomic data that must be stored, processed, and transmitted over the network for sharing. Lossy genomic data compression, especially of the base quality values of sequencing data, is emerging as an efficient way to handle this challenge due to its superior compression performance compared to lossless compression methods. Many lossy compression algorithms have been developed for and evaluated using DNA sequencing data. However, whether these algorithms can be used on RNA sequencing (RNA-seq) data remains unclear. RESULTS: In this study, we evaluated the impacts of lossy quality value compression on common RNA-seq data analysis pipelines including expression quantification, transcriptome assembly, and short variants detection using RNA-seq data from different species and sequencing platforms. Our study shows that lossy quality value compression could effectively improve RNA-seq data compression. In some cases, lossy algorithms achieved up to 1.2-3 times further reduction on the overall RNA-seq data size compared to existing lossless algorithms. However, lossy quality value compression could affect the results of some RNA-seq data processing pipelines, and hence its impacts to RNA-seq studies cannot be ignored in some cases. Pipelines using HISAT2 for alignment were most significantly affected by lossy quality value compression, while the effects of lossy compression on pipelines that do not depend on quality values, e.g., STAR-based expression quantification and transcriptome assembly pipelines, were not observed. Moreover, regardless of using either STAR or HISAT2 as the aligner, variant detection results were affected by lossy quality value compression, albeit to a lesser extent when STAR-based pipeline was used. Our results also show that the impacts of lossy quality value compression depend on the compression algorithms being used and the compression levels if the algorithm supports setting of multiple compression levels. CONCLUSIONS: Lossy quality value compression can be incorporated into existing RNA-seq analysis pipelines to alleviate the data storage and transmission burdens. However, care should be taken on the selection of compression tools and levels based on the requirements of the downstream analysis pipelines to avoid introducing undesirable adverse effects on the analysis results. BioMed Central 2020-07-20 /pmc/articles/PMC7372835/ /pubmed/32689929 http://dx.doi.org/10.1186/s12859-020-03658-4 Text en © The Author(s) 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Methodology Article Yu, Rongshan Yang, Wenxian Wang, Shun Performance evaluation of lossy quality compression algorithms for RNA-seq data |
title | Performance evaluation of lossy quality compression algorithms for RNA-seq data |
title_full | Performance evaluation of lossy quality compression algorithms for RNA-seq data |
title_fullStr | Performance evaluation of lossy quality compression algorithms for RNA-seq data |
title_full_unstemmed | Performance evaluation of lossy quality compression algorithms for RNA-seq data |
title_short | Performance evaluation of lossy quality compression algorithms for RNA-seq data |
title_sort | performance evaluation of lossy quality compression algorithms for rna-seq data |
topic | Methodology Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7372835/ https://www.ncbi.nlm.nih.gov/pubmed/32689929 http://dx.doi.org/10.1186/s12859-020-03658-4 |
work_keys_str_mv | AT yurongshan performanceevaluationoflossyqualitycompressionalgorithmsforrnaseqdata AT yangwenxian performanceevaluationoflossyqualitycompressionalgorithmsforrnaseqdata AT wangshun performanceevaluationoflossyqualitycompressionalgorithmsforrnaseqdata |