Cargando…

NeatFreq: reference-free data reduction and coverage normalization for De Novo sequence assembly

BACKGROUND: Deep shotgun sequencing on next generation sequencing (NGS) platforms has contributed significant amounts of data to enrich our understanding of genomes, transcriptomes, amplified single-cell genomes, and metagenomes. However, deep coverage variations in short-read data sets and high seq...

Descripción completa

Detalles Bibliográficos
Autores principales: McCorrison, Jamison M, Venepally, Pratap, Singh, Indresh, Fouts, Derrick E, Lasken, Roger S, Methé, Barbara A
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4245761/
https://www.ncbi.nlm.nih.gov/pubmed/25407910
http://dx.doi.org/10.1186/s12859-014-0357-3
_version_ 1782346417948327936
author McCorrison, Jamison M
Venepally, Pratap
Singh, Indresh
Fouts, Derrick E
Lasken, Roger S
Methé, Barbara A
author_facet McCorrison, Jamison M
Venepally, Pratap
Singh, Indresh
Fouts, Derrick E
Lasken, Roger S
Methé, Barbara A
author_sort McCorrison, Jamison M
collection PubMed
description BACKGROUND: Deep shotgun sequencing on next generation sequencing (NGS) platforms has contributed significant amounts of data to enrich our understanding of genomes, transcriptomes, amplified single-cell genomes, and metagenomes. However, deep coverage variations in short-read data sets and high sequencing error rates of modern sequencers present new computational challenges in data interpretation, including mapping and de novo assembly. New lab techniques such as multiple displacement amplification (MDA) of single cells and sequence independent single primer amplification (SISPA) allow for sequencing of organisms that cannot be cultured, but generate highly variable coverage due to amplification biases. RESULTS: Here we introduce NeatFreq, a software tool that reduces a data set to more uniform coverage by clustering and selecting from reads binned by their median kmer frequency (RMKF) and uniqueness. Previous algorithms normalize read coverage based on RMKF, but do not include methods for the preferred selection of (1) extremely low coverage regions produced by extremely variable sequencing of random-primed products and (2) 2-sided paired-end sequences. The algorithm increases the incorporation of the most unique, lowest coverage, segments of a genome using an error-corrected data set. NeatFreq was applied to bacterial, viral plaque, and single-cell sequencing data. The algorithm showed an increase in the rate at which the most unique reads in a genome were included in the assembled consensus while also reducing the count of duplicative and erroneous contigs (strings of high confidence overlaps) in the deliverable consensus. The results obtained from conventional Overlap-Layout-Consensus (OLC) were compared to simulated multi-de Bruijn graph assembly alternatives trained for variable coverage input using sequence before and after normalization of coverage. Coverage reduction was shown to increase processing speed and reduce memory requirements when using conventional bacterial assembly algorithms. CONCLUSIONS: The normalization of deep coverage spikes, which would otherwise inhibit consensus resolution, enables High Throughput Sequencing (HTS) assembly projects to consistently run to completion with existing assembly software. The NeatFreq software package is free, open source and available at https://github.com/bioh4x/NeatFreq. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-014-0357-3) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4245761
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-42457612014-11-28 NeatFreq: reference-free data reduction and coverage normalization for De Novo sequence assembly McCorrison, Jamison M Venepally, Pratap Singh, Indresh Fouts, Derrick E Lasken, Roger S Methé, Barbara A BMC Bioinformatics Software BACKGROUND: Deep shotgun sequencing on next generation sequencing (NGS) platforms has contributed significant amounts of data to enrich our understanding of genomes, transcriptomes, amplified single-cell genomes, and metagenomes. However, deep coverage variations in short-read data sets and high sequencing error rates of modern sequencers present new computational challenges in data interpretation, including mapping and de novo assembly. New lab techniques such as multiple displacement amplification (MDA) of single cells and sequence independent single primer amplification (SISPA) allow for sequencing of organisms that cannot be cultured, but generate highly variable coverage due to amplification biases. RESULTS: Here we introduce NeatFreq, a software tool that reduces a data set to more uniform coverage by clustering and selecting from reads binned by their median kmer frequency (RMKF) and uniqueness. Previous algorithms normalize read coverage based on RMKF, but do not include methods for the preferred selection of (1) extremely low coverage regions produced by extremely variable sequencing of random-primed products and (2) 2-sided paired-end sequences. The algorithm increases the incorporation of the most unique, lowest coverage, segments of a genome using an error-corrected data set. NeatFreq was applied to bacterial, viral plaque, and single-cell sequencing data. The algorithm showed an increase in the rate at which the most unique reads in a genome were included in the assembled consensus while also reducing the count of duplicative and erroneous contigs (strings of high confidence overlaps) in the deliverable consensus. The results obtained from conventional Overlap-Layout-Consensus (OLC) were compared to simulated multi-de Bruijn graph assembly alternatives trained for variable coverage input using sequence before and after normalization of coverage. Coverage reduction was shown to increase processing speed and reduce memory requirements when using conventional bacterial assembly algorithms. CONCLUSIONS: The normalization of deep coverage spikes, which would otherwise inhibit consensus resolution, enables High Throughput Sequencing (HTS) assembly projects to consistently run to completion with existing assembly software. The NeatFreq software package is free, open source and available at https://github.com/bioh4x/NeatFreq. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-014-0357-3) contains supplementary material, which is available to authorized users. BioMed Central 2014-11-19 /pmc/articles/PMC4245761/ /pubmed/25407910 http://dx.doi.org/10.1186/s12859-014-0357-3 Text en © McCorrison et al.; licensee BioMed Central Ltd. 2014 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Software
McCorrison, Jamison M
Venepally, Pratap
Singh, Indresh
Fouts, Derrick E
Lasken, Roger S
Methé, Barbara A
NeatFreq: reference-free data reduction and coverage normalization for De Novo sequence assembly
title NeatFreq: reference-free data reduction and coverage normalization for De Novo sequence assembly
title_full NeatFreq: reference-free data reduction and coverage normalization for De Novo sequence assembly
title_fullStr NeatFreq: reference-free data reduction and coverage normalization for De Novo sequence assembly
title_full_unstemmed NeatFreq: reference-free data reduction and coverage normalization for De Novo sequence assembly
title_short NeatFreq: reference-free data reduction and coverage normalization for De Novo sequence assembly
title_sort neatfreq: reference-free data reduction and coverage normalization for de novo sequence assembly
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4245761/
https://www.ncbi.nlm.nih.gov/pubmed/25407910
http://dx.doi.org/10.1186/s12859-014-0357-3
work_keys_str_mv AT mccorrisonjamisonm neatfreqreferencefreedatareductionandcoveragenormalizationfordenovosequenceassembly
AT venepallypratap neatfreqreferencefreedatareductionandcoveragenormalizationfordenovosequenceassembly
AT singhindresh neatfreqreferencefreedatareductionandcoveragenormalizationfordenovosequenceassembly
AT foutsderricke neatfreqreferencefreedatareductionandcoveragenormalizationfordenovosequenceassembly
AT laskenrogers neatfreqreferencefreedatareductionandcoveragenormalizationfordenovosequenceassembly
AT methebarbaraa neatfreqreferencefreedatareductionandcoveragenormalizationfordenovosequenceassembly