Cargando…

In silico read normalization using set multi-cover optimization

MOTIVATION: De Bruijn graphs are a common assembly data structure for sequencing datasets. But with the advances in sequencing technologies, assembling high coverage datasets has become a computational challenge. Read normalization, which removes redundancy in datasets, is widely applied to reduce r...

Descripción completa

Detalles Bibliográficos
Autores principales:	Durai, Dilip A, Schulz, Marcel H
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2018
Materias:	Original Papers
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6157080/ https://www.ncbi.nlm.nih.gov/pubmed/29912280 http://dx.doi.org/10.1093/bioinformatics/bty307

_version_	1783358205398614016
author	Durai, Dilip A Schulz, Marcel H
author_facet	Durai, Dilip A Schulz, Marcel H
author_sort	Durai, Dilip A
collection	PubMed
description	MOTIVATION: De Bruijn graphs are a common assembly data structure for sequencing datasets. But with the advances in sequencing technologies, assembling high coverage datasets has become a computational challenge. Read normalization, which removes redundancy in datasets, is widely applied to reduce resource requirements. Current normalization algorithms, though efficient, provide no guarantee to preserve important k-mers that form connections between regions in the graph. RESULTS: Here, normalization is phrased as a set multi-cover problem on reads and a heuristic algorithm, Optimized Read Normalization Algorithm (ORNA), is proposed. ORNA normalizes to the minimum number of reads required to retain all k-mers and their relative k-mer abundances from the original dataset. Hence, all connections from the original graph are preserved. ORNA was tested on various RNA-seq datasets with different coverage values. It was compared to the current normalization algorithms and was found to be performing better. Normalizing error corrected data allows for more accurate assemblies compared to the normalized uncorrected dataset. Further, an application is proposed in which multiple datasets are combined and normalized to predict novel transcripts that would have been missed otherwise. Finally, ORNA is a general purpose normalization algorithm that is fast and significantly reduces datasets with loss of assembly quality in between [1, 30]% depending on reduction stringency. AVAILABILITY AND IMPLEMENTATION: ORNA is available at https://github.com/SchulzLab/ORNA. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format	Online Article Text
id	pubmed-6157080
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-61570802018-10-01 In silico read normalization using set multi-cover optimization Durai, Dilip A Schulz, Marcel H Bioinformatics Original Papers MOTIVATION: De Bruijn graphs are a common assembly data structure for sequencing datasets. But with the advances in sequencing technologies, assembling high coverage datasets has become a computational challenge. Read normalization, which removes redundancy in datasets, is widely applied to reduce resource requirements. Current normalization algorithms, though efficient, provide no guarantee to preserve important k-mers that form connections between regions in the graph. RESULTS: Here, normalization is phrased as a set multi-cover problem on reads and a heuristic algorithm, Optimized Read Normalization Algorithm (ORNA), is proposed. ORNA normalizes to the minimum number of reads required to retain all k-mers and their relative k-mer abundances from the original dataset. Hence, all connections from the original graph are preserved. ORNA was tested on various RNA-seq datasets with different coverage values. It was compared to the current normalization algorithms and was found to be performing better. Normalizing error corrected data allows for more accurate assemblies compared to the normalized uncorrected dataset. Further, an application is proposed in which multiple datasets are combined and normalized to predict novel transcripts that would have been missed otherwise. Finally, ORNA is a general purpose normalization algorithm that is fast and significantly reduces datasets with loss of assembly quality in between [1, 30]% depending on reduction stringency. AVAILABILITY AND IMPLEMENTATION: ORNA is available at https://github.com/SchulzLab/ORNA. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2018-10-01 2018-04-18 /pmc/articles/PMC6157080/ /pubmed/29912280 http://dx.doi.org/10.1093/bioinformatics/bty307 Text en © The Author(s) 2018. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle	Original Papers Durai, Dilip A Schulz, Marcel H In silico read normalization using set multi-cover optimization
title	In silico read normalization using set multi-cover optimization
title_full	In silico read normalization using set multi-cover optimization
title_fullStr	In silico read normalization using set multi-cover optimization
title_full_unstemmed	In silico read normalization using set multi-cover optimization
title_short	In silico read normalization using set multi-cover optimization
title_sort	in silico read normalization using set multi-cover optimization
topic	Original Papers
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6157080/ https://www.ncbi.nlm.nih.gov/pubmed/29912280 http://dx.doi.org/10.1093/bioinformatics/bty307
work_keys_str_mv	AT duraidilipa insilicoreadnormalizationusingsetmulticoveroptimization AT schulzmarcelh insilicoreadnormalizationusingsetmulticoveroptimization

In silico read normalization using set multi-cover optimization

Ejemplares similares