Cargando…

SPRISS: approximating frequent k-mers by sampling reads, and applications

MOTIVATION: The extraction of k-mers is a fundamental component in many complex analyses of large next-generation sequencing datasets, including reads classification in genomics and the characterization of RNA-seq datasets. The extraction of all k-mers and their frequencies is extremely demanding in...

Descripción completa

Detalles Bibliográficos
Autores principales:	Santoro, Diego, Pellegrina, Leonardo, Comin, Matteo, Vandin, Fabio
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2022
Materias:	Original Papers
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9237683/ https://www.ncbi.nlm.nih.gov/pubmed/35583271 http://dx.doi.org/10.1093/bioinformatics/btac180

_version_	1784736855007166464
author	Santoro, Diego Pellegrina, Leonardo Comin, Matteo Vandin, Fabio
author_facet	Santoro, Diego Pellegrina, Leonardo Comin, Matteo Vandin, Fabio
author_sort	Santoro, Diego
collection	PubMed
description	MOTIVATION: The extraction of k-mers is a fundamental component in many complex analyses of large next-generation sequencing datasets, including reads classification in genomics and the characterization of RNA-seq datasets. The extraction of all k-mers and their frequencies is extremely demanding in terms of running time and memory, owing to the size of the data and to the exponential number of k-mers to be considered. However, in several applications, only frequent k-mers, which are k-mers appearing in a relatively high proportion of the data, are required by the analysis. RESULTS: In this work, we present SPRISS, a new efficient algorithm to approximate frequent k-mers and their frequencies in next-generation sequencing data. SPRISS uses a simple yet powerful reads sampling scheme, which allows to extract a representative subset of the dataset that can be used, in combination with any k-mer counting algorithm, to perform downstream analyses in a fraction of the time required by the analysis of the whole data, while obtaining comparable answers. Our extensive experimental evaluation demonstrates the efficiency and accuracy of SPRISS in approximating frequent k-mers, and shows that it can be used in various scenarios, such as the comparison of metagenomic datasets, the identification of discriminative k-mers, and SNP (single nucleotide polymorphism) genotyping, to extract insights in a fraction of the time required by the analysis of the whole dataset. AVAILABILITY AND IMPLEMENTATION: SPRISS [a preliminary version (Santoro et al., 2021) of this work was presented at RECOMB 2021] is available at https://github.com/VandinLab/SPRISS. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format	Online Article Text
id	pubmed-9237683
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-92376832022-06-29 SPRISS: approximating frequent k-mers by sampling reads, and applications Santoro, Diego Pellegrina, Leonardo Comin, Matteo Vandin, Fabio Bioinformatics Original Papers MOTIVATION: The extraction of k-mers is a fundamental component in many complex analyses of large next-generation sequencing datasets, including reads classification in genomics and the characterization of RNA-seq datasets. The extraction of all k-mers and their frequencies is extremely demanding in terms of running time and memory, owing to the size of the data and to the exponential number of k-mers to be considered. However, in several applications, only frequent k-mers, which are k-mers appearing in a relatively high proportion of the data, are required by the analysis. RESULTS: In this work, we present SPRISS, a new efficient algorithm to approximate frequent k-mers and their frequencies in next-generation sequencing data. SPRISS uses a simple yet powerful reads sampling scheme, which allows to extract a representative subset of the dataset that can be used, in combination with any k-mer counting algorithm, to perform downstream analyses in a fraction of the time required by the analysis of the whole data, while obtaining comparable answers. Our extensive experimental evaluation demonstrates the efficiency and accuracy of SPRISS in approximating frequent k-mers, and shows that it can be used in various scenarios, such as the comparison of metagenomic datasets, the identification of discriminative k-mers, and SNP (single nucleotide polymorphism) genotyping, to extract insights in a fraction of the time required by the analysis of the whole dataset. AVAILABILITY AND IMPLEMENTATION: SPRISS [a preliminary version (Santoro et al., 2021) of this work was presented at RECOMB 2021] is available at https://github.com/VandinLab/SPRISS. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2022-05-18 /pmc/articles/PMC9237683/ /pubmed/35583271 http://dx.doi.org/10.1093/bioinformatics/btac180 Text en © The Author(s) 2022. Published by Oxford University Press. https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle	Original Papers Santoro, Diego Pellegrina, Leonardo Comin, Matteo Vandin, Fabio SPRISS: approximating frequent k-mers by sampling reads, and applications
title	SPRISS: approximating frequent k-mers by sampling reads, and applications
title_full	SPRISS: approximating frequent k-mers by sampling reads, and applications
title_fullStr	SPRISS: approximating frequent k-mers by sampling reads, and applications
title_full_unstemmed	SPRISS: approximating frequent k-mers by sampling reads, and applications
title_short	SPRISS: approximating frequent k-mers by sampling reads, and applications
title_sort	spriss: approximating frequent k-mers by sampling reads, and applications
topic	Original Papers
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9237683/ https://www.ncbi.nlm.nih.gov/pubmed/35583271 http://dx.doi.org/10.1093/bioinformatics/btac180
work_keys_str_mv	AT santorodiego sprissapproximatingfrequentkmersbysamplingreadsandapplications AT pellegrinaleonardo sprissapproximatingfrequentkmersbysamplingreadsandapplications AT cominmatteo sprissapproximatingfrequentkmersbysamplingreadsandapplications AT vandinfabio sprissapproximatingfrequentkmersbysamplingreadsandapplications

SPRISS: approximating frequent k-mers by sampling reads, and applications

Ejemplares similares