Cargando…

ntEdit: scalable genome sequence polishing

MOTIVATION: In the modern genomics era, genome sequence assemblies are routine practice. However, depending on the methodology, resulting drafts may contain considerable base errors. Although utilities exist for genome base polishing, they work best with high read coverage and do not scale well. We...

Descripción completa

Detalles Bibliográficos
Autores principales:	Warren, René L, Coombe, Lauren, Mohamadi, Hamid, Zhang, Jessica, Jaquish, Barry, Isabel, Nathalie, Jones, Steven J M, Bousquet, Jean, Bohlmann, Joerg, Birol, Inanç
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2019
Materias:	Applications Notes
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6821332/ https://www.ncbi.nlm.nih.gov/pubmed/31095290 http://dx.doi.org/10.1093/bioinformatics/btz400

_version_	1783464124689154048
author	Warren, René L Coombe, Lauren Mohamadi, Hamid Zhang, Jessica Jaquish, Barry Isabel, Nathalie Jones, Steven J M Bousquet, Jean Bohlmann, Joerg Birol, Inanç
author_facet	Warren, René L Coombe, Lauren Mohamadi, Hamid Zhang, Jessica Jaquish, Barry Isabel, Nathalie Jones, Steven J M Bousquet, Jean Bohlmann, Joerg Birol, Inanç
author_sort	Warren, René L
collection	PubMed
description	MOTIVATION: In the modern genomics era, genome sequence assemblies are routine practice. However, depending on the methodology, resulting drafts may contain considerable base errors. Although utilities exist for genome base polishing, they work best with high read coverage and do not scale well. We developed ntEdit, a Bloom filter-based genome sequence editing utility that scales to large mammalian and conifer genomes. RESULTS: We first tested ntEdit and the state-of-the-art assembly improvement tools GATK, Pilon and Racon on controlled Escherichia coli and Caenorhabditis elegans sequence data. Generally, ntEdit performs well at low sequence depths (<20×), fixing the majority (>97%) of base substitutions and indels, and its performance is largely constant with increased coverage. In all experiments conducted using a single CPU, the ntEdit pipeline executed in <14 s and <3 m, on average, on E.coli and C.elegans, respectively. We performed similar benchmarks on a sub-20× coverage human genome sequence dataset, inspecting accuracy and resource usage in editing chromosomes 1 and 21, and whole genome. ntEdit scaled linearly, executing in 30–40 m on those sequences. We show how ntEdit ran in <2 h 20 m to improve upon long and linked read human genome assemblies of NA12878, using high-coverage (54×) Illumina sequence data from the same individual, fixing frame shifts in coding sequences. We also generated 17-fold coverage spruce sequence data from haploid sequence sources (seed megagametophyte), and used it to edit our pseudo haploid assemblies of the 20 Gb interior and white spruce genomes in <4 and <5 h, respectively, making roughly 50M edits at a (substitution+indel) rate of 0.0024. AVAILABILITY AND IMPLEMENTATION: https://github.com/bcgsc/ntedit SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format	Online Article Text
id	pubmed-6821332
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-68213322019-11-04 ntEdit: scalable genome sequence polishing Warren, René L Coombe, Lauren Mohamadi, Hamid Zhang, Jessica Jaquish, Barry Isabel, Nathalie Jones, Steven J M Bousquet, Jean Bohlmann, Joerg Birol, Inanç Bioinformatics Applications Notes MOTIVATION: In the modern genomics era, genome sequence assemblies are routine practice. However, depending on the methodology, resulting drafts may contain considerable base errors. Although utilities exist for genome base polishing, they work best with high read coverage and do not scale well. We developed ntEdit, a Bloom filter-based genome sequence editing utility that scales to large mammalian and conifer genomes. RESULTS: We first tested ntEdit and the state-of-the-art assembly improvement tools GATK, Pilon and Racon on controlled Escherichia coli and Caenorhabditis elegans sequence data. Generally, ntEdit performs well at low sequence depths (<20×), fixing the majority (>97%) of base substitutions and indels, and its performance is largely constant with increased coverage. In all experiments conducted using a single CPU, the ntEdit pipeline executed in <14 s and <3 m, on average, on E.coli and C.elegans, respectively. We performed similar benchmarks on a sub-20× coverage human genome sequence dataset, inspecting accuracy and resource usage in editing chromosomes 1 and 21, and whole genome. ntEdit scaled linearly, executing in 30–40 m on those sequences. We show how ntEdit ran in <2 h 20 m to improve upon long and linked read human genome assemblies of NA12878, using high-coverage (54×) Illumina sequence data from the same individual, fixing frame shifts in coding sequences. We also generated 17-fold coverage spruce sequence data from haploid sequence sources (seed megagametophyte), and used it to edit our pseudo haploid assemblies of the 20 Gb interior and white spruce genomes in <4 and <5 h, respectively, making roughly 50M edits at a (substitution+indel) rate of 0.0024. AVAILABILITY AND IMPLEMENTATION: https://github.com/bcgsc/ntedit SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2019-11-01 2019-05-16 /pmc/articles/PMC6821332/ /pubmed/31095290 http://dx.doi.org/10.1093/bioinformatics/btz400 Text en © The Author(s) 2019. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Applications Notes Warren, René L Coombe, Lauren Mohamadi, Hamid Zhang, Jessica Jaquish, Barry Isabel, Nathalie Jones, Steven J M Bousquet, Jean Bohlmann, Joerg Birol, Inanç ntEdit: scalable genome sequence polishing
title	ntEdit: scalable genome sequence polishing
title_full	ntEdit: scalable genome sequence polishing
title_fullStr	ntEdit: scalable genome sequence polishing
title_full_unstemmed	ntEdit: scalable genome sequence polishing
title_short	ntEdit: scalable genome sequence polishing
title_sort	ntedit: scalable genome sequence polishing
topic	Applications Notes
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6821332/ https://www.ncbi.nlm.nih.gov/pubmed/31095290 http://dx.doi.org/10.1093/bioinformatics/btz400
work_keys_str_mv	AT warrenrenel nteditscalablegenomesequencepolishing AT coombelauren nteditscalablegenomesequencepolishing AT mohamadihamid nteditscalablegenomesequencepolishing AT zhangjessica nteditscalablegenomesequencepolishing AT jaquishbarry nteditscalablegenomesequencepolishing AT isabelnathalie nteditscalablegenomesequencepolishing AT jonesstevenjm nteditscalablegenomesequencepolishing AT bousquetjean nteditscalablegenomesequencepolishing AT bohlmannjoerg nteditscalablegenomesequencepolishing AT birolinanc nteditscalablegenomesequencepolishing

ntEdit: scalable genome sequence polishing

Ejemplares similares