Cargando…

Fast, accurate, and lightweight analysis of BS-treated reads with ERNE 2

BACKGROUND: Bisulfite treatment of DNA followed by sequencing (BS-seq) has become a standard technique in epigenetic studies, providing researchers with tools for generating single-base resolution maps of whole methylomes. Aligning bisulfite-treated reads, however, is a computationally difficult tas...

Descripción completa

Detalles Bibliográficos
Autores principales: Prezza, Nicola, Vezzi, Francesco, Käller, Max, Policriti, Alberto
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4896272/
https://www.ncbi.nlm.nih.gov/pubmed/26961371
http://dx.doi.org/10.1186/s12859-016-0910-3
_version_ 1782436001782693888
author Prezza, Nicola
Vezzi, Francesco
Käller, Max
Policriti, Alberto
author_facet Prezza, Nicola
Vezzi, Francesco
Käller, Max
Policriti, Alberto
author_sort Prezza, Nicola
collection PubMed
description BACKGROUND: Bisulfite treatment of DNA followed by sequencing (BS-seq) has become a standard technique in epigenetic studies, providing researchers with tools for generating single-base resolution maps of whole methylomes. Aligning bisulfite-treated reads, however, is a computationally difficult task: bisulfite treatment decreases the (lexical) complexity of low-methylated genomic regions, and C-to-T mismatches may reflect cytosine unmethylation rather than SNPs or sequencing errors. Further challenges arise both during and after the alignment phase: data structures used by the aligner should be fast and should fit into main memory, and the methylation-caller output should be somehow compressed, due to its significant size. METHODS: As far as data structures employed to align bisulfite-treated reads are concerned, solutions proposed in the literature can be roughly grouped into two main categories: those storing pointers at each text position (e.g. hash tables, suffix trees/arrays), and those using the information-theoretic minimum number of bits (e.g. FM indexes and compressed suffix arrays). The former are fast and memory consuming. The latter are much slower and light. In this paper, we try to close this gap proposing a data structure for aligning bisulfite-treated reads which is at the same time fast, light, and very accurate. We reach this objective by combining a recent theoretical result on succinct hashing with a bisulfite-aware hash function. Furthermore, the new versions of the tools implementing our ideas|the aligner ERNE-BS5 2 and the caller ERNE-METH 2|have been extended with increased downstream compatibility (EPP/Bismark cov output formats), output compression, and support for target enrichment protocols. RESULTS: Experimental results on public and simulated WGBS libraries show that our algorithmic solution is a competitive tradeoff between hash-based and BWT-based indexes, being as fast and accurate as the former, and as memory-efficient as the latter. CONCLUSIONS: The new functionalities of our bisulfite aligner and caller make it a fast and memory efficient tool, useful to analyze big datasets with little computational resources, to easily process target enrichment data, and produce statistics such as protocol efficiency and coverage as a function of the distance from target regions. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-0910-3) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4896272
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-48962722016-06-10 Fast, accurate, and lightweight analysis of BS-treated reads with ERNE 2 Prezza, Nicola Vezzi, Francesco Käller, Max Policriti, Alberto BMC Bioinformatics Research Article BACKGROUND: Bisulfite treatment of DNA followed by sequencing (BS-seq) has become a standard technique in epigenetic studies, providing researchers with tools for generating single-base resolution maps of whole methylomes. Aligning bisulfite-treated reads, however, is a computationally difficult task: bisulfite treatment decreases the (lexical) complexity of low-methylated genomic regions, and C-to-T mismatches may reflect cytosine unmethylation rather than SNPs or sequencing errors. Further challenges arise both during and after the alignment phase: data structures used by the aligner should be fast and should fit into main memory, and the methylation-caller output should be somehow compressed, due to its significant size. METHODS: As far as data structures employed to align bisulfite-treated reads are concerned, solutions proposed in the literature can be roughly grouped into two main categories: those storing pointers at each text position (e.g. hash tables, suffix trees/arrays), and those using the information-theoretic minimum number of bits (e.g. FM indexes and compressed suffix arrays). The former are fast and memory consuming. The latter are much slower and light. In this paper, we try to close this gap proposing a data structure for aligning bisulfite-treated reads which is at the same time fast, light, and very accurate. We reach this objective by combining a recent theoretical result on succinct hashing with a bisulfite-aware hash function. Furthermore, the new versions of the tools implementing our ideas|the aligner ERNE-BS5 2 and the caller ERNE-METH 2|have been extended with increased downstream compatibility (EPP/Bismark cov output formats), output compression, and support for target enrichment protocols. RESULTS: Experimental results on public and simulated WGBS libraries show that our algorithmic solution is a competitive tradeoff between hash-based and BWT-based indexes, being as fast and accurate as the former, and as memory-efficient as the latter. CONCLUSIONS: The new functionalities of our bisulfite aligner and caller make it a fast and memory efficient tool, useful to analyze big datasets with little computational resources, to easily process target enrichment data, and produce statistics such as protocol efficiency and coverage as a function of the distance from target regions. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-0910-3) contains supplementary material, which is available to authorized users. BioMed Central 2016-03-02 /pmc/articles/PMC4896272/ /pubmed/26961371 http://dx.doi.org/10.1186/s12859-016-0910-3 Text en © Prezza et al. 2016 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Prezza, Nicola
Vezzi, Francesco
Käller, Max
Policriti, Alberto
Fast, accurate, and lightweight analysis of BS-treated reads with ERNE 2
title Fast, accurate, and lightweight analysis of BS-treated reads with ERNE 2
title_full Fast, accurate, and lightweight analysis of BS-treated reads with ERNE 2
title_fullStr Fast, accurate, and lightweight analysis of BS-treated reads with ERNE 2
title_full_unstemmed Fast, accurate, and lightweight analysis of BS-treated reads with ERNE 2
title_short Fast, accurate, and lightweight analysis of BS-treated reads with ERNE 2
title_sort fast, accurate, and lightweight analysis of bs-treated reads with erne 2
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4896272/
https://www.ncbi.nlm.nih.gov/pubmed/26961371
http://dx.doi.org/10.1186/s12859-016-0910-3
work_keys_str_mv AT prezzanicola fastaccurateandlightweightanalysisofbstreatedreadswitherne2
AT vezzifrancesco fastaccurateandlightweightanalysisofbstreatedreadswitherne2
AT kallermax fastaccurateandlightweightanalysisofbstreatedreadswitherne2
AT policritialberto fastaccurateandlightweightanalysisofbstreatedreadswitherne2