Cargando…

Evaluating long-read de novo assembly tools for eukaryotic genomes: insights and considerations

BACKGROUND: Assembly algorithm choice should be a deliberate, well-justified decision when researchers create genome assemblies for eukaryotic organisms from third-generation sequencing technologies. While third-generation sequencing by Oxford Nanopore Technologies (ONT) and Pacific Biosciences (Pac...

Descripción completa

Detalles Bibliográficos
Autores principales:	Cosma, Bianca-Maria, Shirali Hossein Zade, Ramin, Jordan, Erin Noel, van Lent, Paul, Peng, Chengyao, Pillay, Stephanie, Abeel, Thomas
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2023
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10673639/ https://www.ncbi.nlm.nih.gov/pubmed/38000912 http://dx.doi.org/10.1093/gigascience/giad100

_version_	1785149625859047424
author	Cosma, Bianca-Maria Shirali Hossein Zade, Ramin Jordan, Erin Noel van Lent, Paul Peng, Chengyao Pillay, Stephanie Abeel, Thomas
author_facet	Cosma, Bianca-Maria Shirali Hossein Zade, Ramin Jordan, Erin Noel van Lent, Paul Peng, Chengyao Pillay, Stephanie Abeel, Thomas
author_sort	Cosma, Bianca-Maria
collection	PubMed
description	BACKGROUND: Assembly algorithm choice should be a deliberate, well-justified decision when researchers create genome assemblies for eukaryotic organisms from third-generation sequencing technologies. While third-generation sequencing by Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio) has overcome the disadvantages of short read lengths specific to next-generation sequencing (NGS), third-generation sequencers are known to produce more error-prone reads, thereby generating a new set of challenges for assembly algorithms and pipelines. However, the introduction of HiFi reads, which offer substantially reduced error rates, has provided a promising solution for more accurate assembly outcomes. Since the introduction of third-generation sequencing technologies, many tools have been developed that aim to take advantage of the longer reads, and researchers need to choose the correct assembler for their projects. RESULTS: We benchmarked state-of-the-art long-read de novo assemblers to help readers make a balanced choice for the assembly of eukaryotes. To this end, we used 12 real and 64 simulated datasets from different eukaryotic genomes, with different read length distributions, imitating PacBio continuous long-read (CLR), PacBio high-fidelity (HiFi), and ONT sequencing to evaluate the assemblers. We include 5 commonly used long-read assemblers in our benchmark: Canu, Flye, Miniasm, Raven, and wtdbg2 for ONT and PacBio CLR reads. For PacBio HiFi reads , we include 5 state-of-the-art HiFi assemblers: HiCanu, Flye, Hifiasm, LJA, and MBG. Evaluation categories address the following metrics: reference-based metrics, assembly statistics, misassembly count, BUSCO completeness, runtime, and RAM usage. Additionally, we investigated the effect of increased read length on the quality of the assemblies and report that read length can, but does not always, positively impact assembly quality. CONCLUSIONS: Our benchmark concludes that there is no assembler that performs the best in all the evaluation categories. However, our results show that overall Flye is the best-performing assembler for PacBio CLR and ONT reads, both on real and simulated data. Meanwhile, best-performing PacBio HiFi assemblers are Hifiasm and LJA. Next, the benchmarking using longer reads shows that the increased read length improves assembly quality, but the extent to which that can be achieved depends on the size and complexity of the reference genome.
format	Online Article Text
id	pubmed-10673639
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-106736392023-11-24 Evaluating long-read de novo assembly tools for eukaryotic genomes: insights and considerations Cosma, Bianca-Maria Shirali Hossein Zade, Ramin Jordan, Erin Noel van Lent, Paul Peng, Chengyao Pillay, Stephanie Abeel, Thomas Gigascience Research BACKGROUND: Assembly algorithm choice should be a deliberate, well-justified decision when researchers create genome assemblies for eukaryotic organisms from third-generation sequencing technologies. While third-generation sequencing by Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio) has overcome the disadvantages of short read lengths specific to next-generation sequencing (NGS), third-generation sequencers are known to produce more error-prone reads, thereby generating a new set of challenges for assembly algorithms and pipelines. However, the introduction of HiFi reads, which offer substantially reduced error rates, has provided a promising solution for more accurate assembly outcomes. Since the introduction of third-generation sequencing technologies, many tools have been developed that aim to take advantage of the longer reads, and researchers need to choose the correct assembler for their projects. RESULTS: We benchmarked state-of-the-art long-read de novo assemblers to help readers make a balanced choice for the assembly of eukaryotes. To this end, we used 12 real and 64 simulated datasets from different eukaryotic genomes, with different read length distributions, imitating PacBio continuous long-read (CLR), PacBio high-fidelity (HiFi), and ONT sequencing to evaluate the assemblers. We include 5 commonly used long-read assemblers in our benchmark: Canu, Flye, Miniasm, Raven, and wtdbg2 for ONT and PacBio CLR reads. For PacBio HiFi reads , we include 5 state-of-the-art HiFi assemblers: HiCanu, Flye, Hifiasm, LJA, and MBG. Evaluation categories address the following metrics: reference-based metrics, assembly statistics, misassembly count, BUSCO completeness, runtime, and RAM usage. Additionally, we investigated the effect of increased read length on the quality of the assemblies and report that read length can, but does not always, positively impact assembly quality. CONCLUSIONS: Our benchmark concludes that there is no assembler that performs the best in all the evaluation categories. However, our results show that overall Flye is the best-performing assembler for PacBio CLR and ONT reads, both on real and simulated data. Meanwhile, best-performing PacBio HiFi assemblers are Hifiasm and LJA. Next, the benchmarking using longer reads shows that the increased read length improves assembly quality, but the extent to which that can be achieved depends on the size and complexity of the reference genome. Oxford University Press 2023-11-24 /pmc/articles/PMC10673639/ /pubmed/38000912 http://dx.doi.org/10.1093/gigascience/giad100 Text en © The Author(s) 2023. Published by Oxford University Press GigaScience. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Cosma, Bianca-Maria Shirali Hossein Zade, Ramin Jordan, Erin Noel van Lent, Paul Peng, Chengyao Pillay, Stephanie Abeel, Thomas Evaluating long-read de novo assembly tools for eukaryotic genomes: insights and considerations
title	Evaluating long-read de novo assembly tools for eukaryotic genomes: insights and considerations
title_full	Evaluating long-read de novo assembly tools for eukaryotic genomes: insights and considerations
title_fullStr	Evaluating long-read de novo assembly tools for eukaryotic genomes: insights and considerations
title_full_unstemmed	Evaluating long-read de novo assembly tools for eukaryotic genomes: insights and considerations
title_short	Evaluating long-read de novo assembly tools for eukaryotic genomes: insights and considerations
title_sort	evaluating long-read de novo assembly tools for eukaryotic genomes: insights and considerations
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10673639/ https://www.ncbi.nlm.nih.gov/pubmed/38000912 http://dx.doi.org/10.1093/gigascience/giad100
work_keys_str_mv	AT cosmabiancamaria evaluatinglongreaddenovoassemblytoolsforeukaryoticgenomesinsightsandconsiderations AT shiralihosseinzaderamin evaluatinglongreaddenovoassemblytoolsforeukaryoticgenomesinsightsandconsiderations AT jordanerinnoel evaluatinglongreaddenovoassemblytoolsforeukaryoticgenomesinsightsandconsiderations AT vanlentpaul evaluatinglongreaddenovoassemblytoolsforeukaryoticgenomesinsightsandconsiderations AT pengchengyao evaluatinglongreaddenovoassemblytoolsforeukaryoticgenomesinsightsandconsiderations AT pillaystephanie evaluatinglongreaddenovoassemblytoolsforeukaryoticgenomesinsightsandconsiderations AT abeelthomas evaluatinglongreaddenovoassemblytoolsforeukaryoticgenomesinsightsandconsiderations

Evaluating long-read de novo assembly tools for eukaryotic genomes: insights and considerations

Ejemplares similares