Cargando…

FastqPuri: high-performance preprocessing of RNA-seq data

BACKGROUND: RNA sequencing (RNA-seq) has become the standard means of analyzing gene and transcript expression in high-throughput. While previously sequence alignment was a time demanding step, fast alignment methods and even more so transcript counting methods which avoid mapping and quantify gene...

Descripción completa

Detalles Bibliográficos
Autores principales:	Pérez-Rubio, Paula, Lottaz, Claudio, Engelmann, Julia C.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2019
Materias:	Software
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6500068/ https://www.ncbi.nlm.nih.gov/pubmed/31053060 http://dx.doi.org/10.1186/s12859-019-2799-0

_version_	1783415884438568960
author	Pérez-Rubio, Paula Lottaz, Claudio Engelmann, Julia C.
author_facet	Pérez-Rubio, Paula Lottaz, Claudio Engelmann, Julia C.
author_sort	Pérez-Rubio, Paula
collection	PubMed
description	BACKGROUND: RNA sequencing (RNA-seq) has become the standard means of analyzing gene and transcript expression in high-throughput. While previously sequence alignment was a time demanding step, fast alignment methods and even more so transcript counting methods which avoid mapping and quantify gene and transcript expression by evaluating whether a read is compatible with a transcript, have led to significant speed-ups in data analysis. Now, the most time demanding step in the analysis of RNA-seq data is preprocessing the raw sequence data, such as running quality control and adapter, contamination and quality filtering before transcript or gene quantification. To do so, many researchers chain different tools, but a comprehensive, flexible and fast software that covers all preprocessing steps is currently missing. RESULTS: We here present FastqPuri, a light-weight and highly efficient preprocessing tool for fastq data. FastqPuri provides sequence quality reports on the sample and dataset level with new plots which facilitate decision making for subsequent quality filtering. Moreover, FastqPuri efficiently removes adapter sequences and sequences from biological contamination from the data. It accepts both single- and paired-end data in uncompressed or compressed fastq files. FastqPuri can be run stand-alone and is suitable to be run within pipelines. We benchmarked FastqPuri against existing tools and found that FastqPuri is superior in terms of speed, memory usage, versatility and comprehensiveness. CONCLUSIONS: FastqPuri is a new tool which covers all aspects of short read sequence data preprocessing. It was designed for RNA-seq data to meet the needs for fast preprocessing of fastq data to allow transcript and gene counting, but it is suitable to process any short read sequencing data of which high sequence quality is needed, such as for genome assembly or SNV (single nucleotide variant) detection. FastqPuri is most flexible in filtering undesired biological sequences by offering two approaches to optimize speed and memory usage dependent on the total size of the potential contaminating sequences. FastqPuri is available at https://github.com/jengelmann/FastqPuri. It is implemented in C and R and licensed under GPL v3. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-019-2799-0) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-6500068
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-65000682019-05-09 FastqPuri: high-performance preprocessing of RNA-seq data Pérez-Rubio, Paula Lottaz, Claudio Engelmann, Julia C. BMC Bioinformatics Software BACKGROUND: RNA sequencing (RNA-seq) has become the standard means of analyzing gene and transcript expression in high-throughput. While previously sequence alignment was a time demanding step, fast alignment methods and even more so transcript counting methods which avoid mapping and quantify gene and transcript expression by evaluating whether a read is compatible with a transcript, have led to significant speed-ups in data analysis. Now, the most time demanding step in the analysis of RNA-seq data is preprocessing the raw sequence data, such as running quality control and adapter, contamination and quality filtering before transcript or gene quantification. To do so, many researchers chain different tools, but a comprehensive, flexible and fast software that covers all preprocessing steps is currently missing. RESULTS: We here present FastqPuri, a light-weight and highly efficient preprocessing tool for fastq data. FastqPuri provides sequence quality reports on the sample and dataset level with new plots which facilitate decision making for subsequent quality filtering. Moreover, FastqPuri efficiently removes adapter sequences and sequences from biological contamination from the data. It accepts both single- and paired-end data in uncompressed or compressed fastq files. FastqPuri can be run stand-alone and is suitable to be run within pipelines. We benchmarked FastqPuri against existing tools and found that FastqPuri is superior in terms of speed, memory usage, versatility and comprehensiveness. CONCLUSIONS: FastqPuri is a new tool which covers all aspects of short read sequence data preprocessing. It was designed for RNA-seq data to meet the needs for fast preprocessing of fastq data to allow transcript and gene counting, but it is suitable to process any short read sequencing data of which high sequence quality is needed, such as for genome assembly or SNV (single nucleotide variant) detection. FastqPuri is most flexible in filtering undesired biological sequences by offering two approaches to optimize speed and memory usage dependent on the total size of the potential contaminating sequences. FastqPuri is available at https://github.com/jengelmann/FastqPuri. It is implemented in C and R and licensed under GPL v3. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-019-2799-0) contains supplementary material, which is available to authorized users. BioMed Central 2019-05-03 /pmc/articles/PMC6500068/ /pubmed/31053060 http://dx.doi.org/10.1186/s12859-019-2799-0 Text en © The Author(s) 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Software Pérez-Rubio, Paula Lottaz, Claudio Engelmann, Julia C. FastqPuri: high-performance preprocessing of RNA-seq data
title	FastqPuri: high-performance preprocessing of RNA-seq data
title_full	FastqPuri: high-performance preprocessing of RNA-seq data
title_fullStr	FastqPuri: high-performance preprocessing of RNA-seq data
title_full_unstemmed	FastqPuri: high-performance preprocessing of RNA-seq data
title_short	FastqPuri: high-performance preprocessing of RNA-seq data
title_sort	fastqpuri: high-performance preprocessing of rna-seq data
topic	Software
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6500068/ https://www.ncbi.nlm.nih.gov/pubmed/31053060 http://dx.doi.org/10.1186/s12859-019-2799-0
work_keys_str_mv	AT perezrubiopaula fastqpurihighperformancepreprocessingofrnaseqdata AT lottazclaudio fastqpurihighperformancepreprocessingofrnaseqdata AT engelmannjuliac fastqpurihighperformancepreprocessingofrnaseqdata

FastqPuri: high-performance preprocessing of RNA-seq data

Ejemplares similares