Cargando…

BigSeqKit: a parallel Big Data toolkit to process FASTA and FASTQ files at scale

BACKGROUND: High-throughput sequencing technologies have led to an unprecedented explosion in the amounts of sequencing data available, which are typically stored using FASTA and FASTQ files. We can find in the literature several tools to process and manipulate those type of files with the aim of tr...

Descripción completa

Detalles Bibliográficos
Autores principales:	Piñeiro, César, Pichel, Juan C
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2023
Materias:	Technical Note
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10388699/ https://www.ncbi.nlm.nih.gov/pubmed/37522758 http://dx.doi.org/10.1093/gigascience/giad062

_version_	1785082177672708096
author	Piñeiro, César Pichel, Juan C
author_facet	Piñeiro, César Pichel, Juan C
author_sort	Piñeiro, César
collection	PubMed
description	BACKGROUND: High-throughput sequencing technologies have led to an unprecedented explosion in the amounts of sequencing data available, which are typically stored using FASTA and FASTQ files. We can find in the literature several tools to process and manipulate those type of files with the aim of transforming sequence data into biological knowledge. However, none of them are well fitted for processing efficiently very large files, likely in the order of terabytes in the following years, since they are based on sequential processing. Only some routines of the well-known seqkit tool are partly parallelized. In any case, its scalability is limited to use few threads on a single computing node. RESULTS: Our approach, BigSeqKit, takes advantage of a high-performance computing–Big Data framework to parallelize and optimize the commands included in seqkit with the aim of speeding up the manipulation of FASTA/FASTQ files. In this way, in most cases, it is from tens to hundreds of times faster than several state-of-the-art tools. At the same time, our toolkit is easy to use and install on any kind of hardware platform (local server or cluster), and its routines can be used as a bioinformatics library or from the command line. CONCLUSIONS: BigSeqKit is a very complete and ultra-fast toolkit to process and manipulate large FASTA and FASTQ files. It is publicly available at https://github.com/citiususc/BigSeqKit.
format	Online Article Text
id	pubmed-10388699
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-103886992023-08-01 BigSeqKit: a parallel Big Data toolkit to process FASTA and FASTQ files at scale Piñeiro, César Pichel, Juan C Gigascience Technical Note BACKGROUND: High-throughput sequencing technologies have led to an unprecedented explosion in the amounts of sequencing data available, which are typically stored using FASTA and FASTQ files. We can find in the literature several tools to process and manipulate those type of files with the aim of transforming sequence data into biological knowledge. However, none of them are well fitted for processing efficiently very large files, likely in the order of terabytes in the following years, since they are based on sequential processing. Only some routines of the well-known seqkit tool are partly parallelized. In any case, its scalability is limited to use few threads on a single computing node. RESULTS: Our approach, BigSeqKit, takes advantage of a high-performance computing–Big Data framework to parallelize and optimize the commands included in seqkit with the aim of speeding up the manipulation of FASTA/FASTQ files. In this way, in most cases, it is from tens to hundreds of times faster than several state-of-the-art tools. At the same time, our toolkit is easy to use and install on any kind of hardware platform (local server or cluster), and its routines can be used as a bioinformatics library or from the command line. CONCLUSIONS: BigSeqKit is a very complete and ultra-fast toolkit to process and manipulate large FASTA and FASTQ files. It is publicly available at https://github.com/citiususc/BigSeqKit. Oxford University Press 2023-07-31 /pmc/articles/PMC10388699/ /pubmed/37522758 http://dx.doi.org/10.1093/gigascience/giad062 Text en © The Author(s) 2023. Published by Oxford University Press GigaScience. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Technical Note Piñeiro, César Pichel, Juan C BigSeqKit: a parallel Big Data toolkit to process FASTA and FASTQ files at scale
title	BigSeqKit: a parallel Big Data toolkit to process FASTA and FASTQ files at scale
title_full	BigSeqKit: a parallel Big Data toolkit to process FASTA and FASTQ files at scale
title_fullStr	BigSeqKit: a parallel Big Data toolkit to process FASTA and FASTQ files at scale
title_full_unstemmed	BigSeqKit: a parallel Big Data toolkit to process FASTA and FASTQ files at scale
title_short	BigSeqKit: a parallel Big Data toolkit to process FASTA and FASTQ files at scale
title_sort	bigseqkit: a parallel big data toolkit to process fasta and fastq files at scale
topic	Technical Note
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10388699/ https://www.ncbi.nlm.nih.gov/pubmed/37522758 http://dx.doi.org/10.1093/gigascience/giad062
work_keys_str_mv	AT pineirocesar bigseqkitaparallelbigdatatoolkittoprocessfastaandfastqfilesatscale AT picheljuanc bigseqkitaparallelbigdatatoolkittoprocessfastaandfastqfilesatscale

BigSeqKit: a parallel Big Data toolkit to process FASTA and FASTQ files at scale

Ejemplares similares