Cargando…

S-leaping: an efficient downsampling method for large high-throughput sequencing data

MOTIVATION: Sequencing coverage is among key determinants considered in the design of omics studies. To help estimate cost-effective sequencing coverage for specific downstream analysis, downsampling, a technique to sample subsets of reads with a specific size, is routinely used. However, as the siz...

Descripción completa

Detalles Bibliográficos
Autores principales:	Kuwahara, Hiroyuki, Gao, Xin
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2023
Materias:	Original Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10318387/ https://www.ncbi.nlm.nih.gov/pubmed/37354496 http://dx.doi.org/10.1093/bioinformatics/btad399

_version_	1785068026229424128
author	Kuwahara, Hiroyuki Gao, Xin
author_facet	Kuwahara, Hiroyuki Gao, Xin
author_sort	Kuwahara, Hiroyuki
collection	PubMed
description	MOTIVATION: Sequencing coverage is among key determinants considered in the design of omics studies. To help estimate cost-effective sequencing coverage for specific downstream analysis, downsampling, a technique to sample subsets of reads with a specific size, is routinely used. However, as the size of sequencing becomes larger and larger, downsampling becomes computationally challenging. RESULTS: Here, we developed an approximate downsampling method called s-leaping that was designed to efficiently and accurately process large-size data. We compared the performance of s-leaping with state-of-the-art downsampling methods in a range of practical omics-study downsampling settings and found s-leaping to be up to 39% faster than the second-fastest method, with comparable accuracy to the exact downsampling methods. To apply s-leaping on FASTQ data, we developed a light-weight tool called fadso in C. Using whole-genome sequencing data with 208 million reads, we compared fadso’s performance with that of a commonly used FASTQ tool with the same downsampling feature and found fadso to be up to 12% faster with 21% lower memory usage, suggesting fadso to have up to 40% higher throughput in a parallel computing setting. AVAILABILITY AND IMPLEMENTATION: The C source code for s-leaping, as well as the fadso package is freely available at https://github.com/hkuwahara/sleaping.
format	Online Article Text
id	pubmed-10318387
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-103183872023-07-05 S-leaping: an efficient downsampling method for large high-throughput sequencing data Kuwahara, Hiroyuki Gao, Xin Bioinformatics Original Paper MOTIVATION: Sequencing coverage is among key determinants considered in the design of omics studies. To help estimate cost-effective sequencing coverage for specific downstream analysis, downsampling, a technique to sample subsets of reads with a specific size, is routinely used. However, as the size of sequencing becomes larger and larger, downsampling becomes computationally challenging. RESULTS: Here, we developed an approximate downsampling method called s-leaping that was designed to efficiently and accurately process large-size data. We compared the performance of s-leaping with state-of-the-art downsampling methods in a range of practical omics-study downsampling settings and found s-leaping to be up to 39% faster than the second-fastest method, with comparable accuracy to the exact downsampling methods. To apply s-leaping on FASTQ data, we developed a light-weight tool called fadso in C. Using whole-genome sequencing data with 208 million reads, we compared fadso’s performance with that of a commonly used FASTQ tool with the same downsampling feature and found fadso to be up to 12% faster with 21% lower memory usage, suggesting fadso to have up to 40% higher throughput in a parallel computing setting. AVAILABILITY AND IMPLEMENTATION: The C source code for s-leaping, as well as the fadso package is freely available at https://github.com/hkuwahara/sleaping. Oxford University Press 2023-06-24 /pmc/articles/PMC10318387/ /pubmed/37354496 http://dx.doi.org/10.1093/bioinformatics/btad399 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Paper Kuwahara, Hiroyuki Gao, Xin S-leaping: an efficient downsampling method for large high-throughput sequencing data
title	S-leaping: an efficient downsampling method for large high-throughput sequencing data
title_full	S-leaping: an efficient downsampling method for large high-throughput sequencing data
title_fullStr	S-leaping: an efficient downsampling method for large high-throughput sequencing data
title_full_unstemmed	S-leaping: an efficient downsampling method for large high-throughput sequencing data
title_short	S-leaping: an efficient downsampling method for large high-throughput sequencing data
title_sort	s-leaping: an efficient downsampling method for large high-throughput sequencing data
topic	Original Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10318387/ https://www.ncbi.nlm.nih.gov/pubmed/37354496 http://dx.doi.org/10.1093/bioinformatics/btad399
work_keys_str_mv	AT kuwaharahiroyuki sleapinganefficientdownsamplingmethodforlargehighthroughputsequencingdata AT gaoxin sleapinganefficientdownsamplingmethodforlargehighthroughputsequencingdata

S-leaping: an efficient downsampling method for large high-throughput sequencing data

Ejemplares similares