Cargando…

Better quality score compression through sequence-based quality smoothing

MOTIVATION: Current NGS techniques are becoming exponentially cheaper. As a result, there is an exponential growth of genomic data unfortunately not followed by an exponential growth of storage, leading to the necessity of compression. Most of the entropy of NGS data lies in the quality values assoc...

Descripción completa

Detalles Bibliográficos
Autores principales:	Shibuya, Yoshihiro, Comin, Matteo
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2019
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6873394/ https://www.ncbi.nlm.nih.gov/pubmed/31757199 http://dx.doi.org/10.1186/s12859-019-2883-5

_version_	1783472641584136192
author	Shibuya, Yoshihiro Comin, Matteo
author_facet	Shibuya, Yoshihiro Comin, Matteo
author_sort	Shibuya, Yoshihiro
collection	PubMed
description	MOTIVATION: Current NGS techniques are becoming exponentially cheaper. As a result, there is an exponential growth of genomic data unfortunately not followed by an exponential growth of storage, leading to the necessity of compression. Most of the entropy of NGS data lies in the quality values associated to each read. Those values are often more diversified than necessary. Because of that, many tools such as Quartz or GeneCodeq, try to change (smooth) quality scores in order to improve compressibility without altering the important information they carry for downstream analysis like SNP calling. RESULTS: We use the FM-Index, a type of compressed suffix array, to reduce the storage requirements of a dictionary of k-mers and an effective smoothing algorithm to maintain high precision for SNP calling pipelines, while reducing quality scores entropy. We present YALFF (Yet Another Lossy Fastq Filter), a tool for quality scores compression by smoothing leading to improved compressibility of FASTQ files. The succinct k-mers dictionary allows YALFF to run on consumer computers with only 5.7 GB of available free RAM. YALFF smoothing algorithm can improve genotyping accuracy while using less resources. AVAILABILITY: https://github.com/yhhshb/yalff ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-019-2883-5) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-6873394
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-68733942019-12-12 Better quality score compression through sequence-based quality smoothing Shibuya, Yoshihiro Comin, Matteo BMC Bioinformatics Research MOTIVATION: Current NGS techniques are becoming exponentially cheaper. As a result, there is an exponential growth of genomic data unfortunately not followed by an exponential growth of storage, leading to the necessity of compression. Most of the entropy of NGS data lies in the quality values associated to each read. Those values are often more diversified than necessary. Because of that, many tools such as Quartz or GeneCodeq, try to change (smooth) quality scores in order to improve compressibility without altering the important information they carry for downstream analysis like SNP calling. RESULTS: We use the FM-Index, a type of compressed suffix array, to reduce the storage requirements of a dictionary of k-mers and an effective smoothing algorithm to maintain high precision for SNP calling pipelines, while reducing quality scores entropy. We present YALFF (Yet Another Lossy Fastq Filter), a tool for quality scores compression by smoothing leading to improved compressibility of FASTQ files. The succinct k-mers dictionary allows YALFF to run on consumer computers with only 5.7 GB of available free RAM. YALFF smoothing algorithm can improve genotyping accuracy while using less resources. AVAILABILITY: https://github.com/yhhshb/yalff ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-019-2883-5) contains supplementary material, which is available to authorized users. BioMed Central 2019-11-22 /pmc/articles/PMC6873394/ /pubmed/31757199 http://dx.doi.org/10.1186/s12859-019-2883-5 Text en © The Author(s) 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Shibuya, Yoshihiro Comin, Matteo Better quality score compression through sequence-based quality smoothing
title	Better quality score compression through sequence-based quality smoothing
title_full	Better quality score compression through sequence-based quality smoothing
title_fullStr	Better quality score compression through sequence-based quality smoothing
title_full_unstemmed	Better quality score compression through sequence-based quality smoothing
title_short	Better quality score compression through sequence-based quality smoothing
title_sort	better quality score compression through sequence-based quality smoothing
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6873394/ https://www.ncbi.nlm.nih.gov/pubmed/31757199 http://dx.doi.org/10.1186/s12859-019-2883-5
work_keys_str_mv	AT shibuyayoshihiro betterqualityscorecompressionthroughsequencebasedqualitysmoothing AT cominmatteo betterqualityscorecompressionthroughsequencebasedqualitysmoothing

Better quality score compression through sequence-based quality smoothing

Ejemplares similares