Cargando…

Unbiased pangenome graphs

MOTIVATION: Pangenome variation graphs model the mutual alignment of collections of DNA sequences. A set of pairwise alignments implies a variation graph, but there are no scalable methods to generate such a graph from these alignments. Existing related approaches depend on a single reference, a spe...

Descripción completa

Detalles Bibliográficos
Autores principales:	Garrison, Erik, Guarracino, Andrea
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2022
Materias:	Original Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9805579/ https://www.ncbi.nlm.nih.gov/pubmed/36448683 http://dx.doi.org/10.1093/bioinformatics/btac743

_version_	1784862357908881408
author	Garrison, Erik Guarracino, Andrea
author_facet	Garrison, Erik Guarracino, Andrea
author_sort	Garrison, Erik
collection	PubMed
description	MOTIVATION: Pangenome variation graphs model the mutual alignment of collections of DNA sequences. A set of pairwise alignments implies a variation graph, but there are no scalable methods to generate such a graph from these alignments. Existing related approaches depend on a single reference, a specific ordering of genomes or a de Bruijn model based on a fixed k-mer length. A scalable, self-contained method to build pangenome graphs without such limitations would be a key step in pangenome construction and manipulation pipelines. RESULTS: We design the seqwish algorithm, which builds a variation graph from a set of sequences and alignments between them. We first transform the alignment set into an implicit interval tree. To build up the variation graph, we query this tree-based representation of the alignments to reduce transitive matches into single DNA segments in a sequence graph. By recording the mapping from input sequence to output graph, we can trace the original paths through this graph, yielding a pangenome variation graph. We present an implementation that operates in external memory, using disk-backed data structures and lock-free parallel methods to drive the core graph induction step. We demonstrate that our method scales to very large graph induction problems by applying it to build pangenome graphs for several species. AVAILABILITY AND IMPLEMENTATION: seqwish is published as free software under the MIT open source license. Source code and documentation are available at https://github.com/ekg/seqwish. seqwish can be installed via Bioconda https://bioconda.github.io/recipes/seqwish/README.html or GNU Guix https://github.com/ekg/guix-genomics/blob/master/seqwish.scm.
format	Online Article Text
id	pubmed-9805579
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-98055792023-01-03 Unbiased pangenome graphs Garrison, Erik Guarracino, Andrea Bioinformatics Original Paper MOTIVATION: Pangenome variation graphs model the mutual alignment of collections of DNA sequences. A set of pairwise alignments implies a variation graph, but there are no scalable methods to generate such a graph from these alignments. Existing related approaches depend on a single reference, a specific ordering of genomes or a de Bruijn model based on a fixed k-mer length. A scalable, self-contained method to build pangenome graphs without such limitations would be a key step in pangenome construction and manipulation pipelines. RESULTS: We design the seqwish algorithm, which builds a variation graph from a set of sequences and alignments between them. We first transform the alignment set into an implicit interval tree. To build up the variation graph, we query this tree-based representation of the alignments to reduce transitive matches into single DNA segments in a sequence graph. By recording the mapping from input sequence to output graph, we can trace the original paths through this graph, yielding a pangenome variation graph. We present an implementation that operates in external memory, using disk-backed data structures and lock-free parallel methods to drive the core graph induction step. We demonstrate that our method scales to very large graph induction problems by applying it to build pangenome graphs for several species. AVAILABILITY AND IMPLEMENTATION: seqwish is published as free software under the MIT open source license. Source code and documentation are available at https://github.com/ekg/seqwish. seqwish can be installed via Bioconda https://bioconda.github.io/recipes/seqwish/README.html or GNU Guix https://github.com/ekg/guix-genomics/blob/master/seqwish.scm. Oxford University Press 2022-11-30 /pmc/articles/PMC9805579/ /pubmed/36448683 http://dx.doi.org/10.1093/bioinformatics/btac743 Text en © The Author(s) 2022. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Paper Garrison, Erik Guarracino, Andrea Unbiased pangenome graphs
title	Unbiased pangenome graphs
title_full	Unbiased pangenome graphs
title_fullStr	Unbiased pangenome graphs
title_full_unstemmed	Unbiased pangenome graphs
title_short	Unbiased pangenome graphs
title_sort	unbiased pangenome graphs
topic	Original Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9805579/ https://www.ncbi.nlm.nih.gov/pubmed/36448683 http://dx.doi.org/10.1093/bioinformatics/btac743
work_keys_str_mv	AT garrisonerik unbiasedpangenomegraphs AT guarracinoandrea unbiasedpangenomegraphs

Unbiased pangenome graphs

Ejemplares similares