Cargando…
Coverage-preserving sparsification of overlap graphs for long-read assembly
MOTIVATION: Read-overlap-based graph data structures play a central role in computing de novo genome assembly. Most long-read assemblers use Myers’s string graph model to sparsify overlap graphs. Graph sparsification improves assembly contiguity by removing spurious and redundant connections. Howeve...
Autor principal: | |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10132763/ https://www.ncbi.nlm.nih.gov/pubmed/36892439 http://dx.doi.org/10.1093/bioinformatics/btad124 |
_version_ | 1785031459100164096 |
---|---|
author | Jain, Chirag |
author_facet | Jain, Chirag |
author_sort | Jain, Chirag |
collection | PubMed |
description | MOTIVATION: Read-overlap-based graph data structures play a central role in computing de novo genome assembly. Most long-read assemblers use Myers’s string graph model to sparsify overlap graphs. Graph sparsification improves assembly contiguity by removing spurious and redundant connections. However, a graph model must be coverage-preserving, i.e. it must ensure that there exist walks in the graph that spell all chromosomes, given sufficient sequencing coverage. This property becomes even more important for diploid genomes, polyploid genomes, and metagenomes where there is a risk of losing haplotype-specific information. RESULTS: We develop a novel theoretical framework under which the coverage-preserving properties of a graph model can be analyzed. We first prove that de Bruijn graph and overlap graph models are guaranteed to be coverage-preserving. We next show that the standard string graph model lacks this guarantee. The latter result is consistent with prior work suggesting that removal of contained reads, i.e. the reads that are substrings of other reads, can lead to coverage gaps during string graph construction. Our experiments done using simulated long reads from HG002 human diploid genome show that 50 coverage gaps are introduced on average by ignoring contained reads from nanopore datasets. To remedy this, we propose practical heuristics that are well-supported by our theoretical results and are useful to decide which contained reads should be retained to avoid coverage gaps. Our method retains a small fraction of contained reads (1–2%) and closes majority of the coverage gaps. AVAILABILITY AND IMPLEMENTATION: Source code is available through GitHub (https://github.com/at-cg/ContainX) and Zenodo with doi: 10.5281/zenodo.7687543. |
format | Online Article Text |
id | pubmed-10132763 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-101327632023-04-27 Coverage-preserving sparsification of overlap graphs for long-read assembly Jain, Chirag Bioinformatics Original Paper MOTIVATION: Read-overlap-based graph data structures play a central role in computing de novo genome assembly. Most long-read assemblers use Myers’s string graph model to sparsify overlap graphs. Graph sparsification improves assembly contiguity by removing spurious and redundant connections. However, a graph model must be coverage-preserving, i.e. it must ensure that there exist walks in the graph that spell all chromosomes, given sufficient sequencing coverage. This property becomes even more important for diploid genomes, polyploid genomes, and metagenomes where there is a risk of losing haplotype-specific information. RESULTS: We develop a novel theoretical framework under which the coverage-preserving properties of a graph model can be analyzed. We first prove that de Bruijn graph and overlap graph models are guaranteed to be coverage-preserving. We next show that the standard string graph model lacks this guarantee. The latter result is consistent with prior work suggesting that removal of contained reads, i.e. the reads that are substrings of other reads, can lead to coverage gaps during string graph construction. Our experiments done using simulated long reads from HG002 human diploid genome show that 50 coverage gaps are introduced on average by ignoring contained reads from nanopore datasets. To remedy this, we propose practical heuristics that are well-supported by our theoretical results and are useful to decide which contained reads should be retained to avoid coverage gaps. Our method retains a small fraction of contained reads (1–2%) and closes majority of the coverage gaps. AVAILABILITY AND IMPLEMENTATION: Source code is available through GitHub (https://github.com/at-cg/ContainX) and Zenodo with doi: 10.5281/zenodo.7687543. Oxford University Press 2023-03-09 /pmc/articles/PMC10132763/ /pubmed/36892439 http://dx.doi.org/10.1093/bioinformatics/btad124 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Original Paper Jain, Chirag Coverage-preserving sparsification of overlap graphs for long-read assembly |
title | Coverage-preserving sparsification of overlap graphs for long-read assembly |
title_full | Coverage-preserving sparsification of overlap graphs for long-read assembly |
title_fullStr | Coverage-preserving sparsification of overlap graphs for long-read assembly |
title_full_unstemmed | Coverage-preserving sparsification of overlap graphs for long-read assembly |
title_short | Coverage-preserving sparsification of overlap graphs for long-read assembly |
title_sort | coverage-preserving sparsification of overlap graphs for long-read assembly |
topic | Original Paper |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10132763/ https://www.ncbi.nlm.nih.gov/pubmed/36892439 http://dx.doi.org/10.1093/bioinformatics/btad124 |
work_keys_str_mv | AT jainchirag coveragepreservingsparsificationofoverlapgraphsforlongreadassembly |