Cargando…

Fully-sensitive seed finding in sequence graphs using a hybrid index

MOTIVATION: Sequence graphs are versatile data structures that are, for instance, able to represent the genetic variation found in a population and to facilitate genome assembly. Read mapping to sequence graphs constitutes an important step for many applications and is usually done by first finding...

Descripción completa

Detalles Bibliográficos
Autores principales: Ghaffaari, Ali, Marschall, Tobias
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6612829/
https://www.ncbi.nlm.nih.gov/pubmed/31510650
http://dx.doi.org/10.1093/bioinformatics/btz341
_version_ 1783432946373361664
author Ghaffaari, Ali
Marschall, Tobias
author_facet Ghaffaari, Ali
Marschall, Tobias
author_sort Ghaffaari, Ali
collection PubMed
description MOTIVATION: Sequence graphs are versatile data structures that are, for instance, able to represent the genetic variation found in a population and to facilitate genome assembly. Read mapping to sequence graphs constitutes an important step for many applications and is usually done by first finding exact seed matches, which are then extended by alignment. Existing methods for finding seed hits prune the graph in complex regions, leading to a loss of information especially in highly polymorphic regions of the genome. While such complex graph structures can indeed lead to a combinatorial explosion of possible alleles, the query set of reads from a diploid individual realizes only two alleles per locus—a property that is not exploited by extant methods. RESULTS: We present the Pan-genome Seed Index (PSI), a fully-sensitive hybrid method for seed finding, which takes full advantage of this property by combining an index over selected paths in the graph with an index over the query reads. This enables PSI to find all seeds while eliminating the need to prune the graph. We demonstrate its performance with different parameter settings on both simulated data and on a whole human genome graph constructed from variants in the 1000 Genome Project dataset. On this graph, PSI outperforms GCSA2 in terms of index size, query time and sensitivity. AVAILABILITY AND IMPLEMENTATION: The C++ implementation is publicly available at: https://github.com/cartoonist/psi.
format Online
Article
Text
id pubmed-6612829
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-66128292019-07-12 Fully-sensitive seed finding in sequence graphs using a hybrid index Ghaffaari, Ali Marschall, Tobias Bioinformatics Ismb/Eccb 2019 Conference Proceedings MOTIVATION: Sequence graphs are versatile data structures that are, for instance, able to represent the genetic variation found in a population and to facilitate genome assembly. Read mapping to sequence graphs constitutes an important step for many applications and is usually done by first finding exact seed matches, which are then extended by alignment. Existing methods for finding seed hits prune the graph in complex regions, leading to a loss of information especially in highly polymorphic regions of the genome. While such complex graph structures can indeed lead to a combinatorial explosion of possible alleles, the query set of reads from a diploid individual realizes only two alleles per locus—a property that is not exploited by extant methods. RESULTS: We present the Pan-genome Seed Index (PSI), a fully-sensitive hybrid method for seed finding, which takes full advantage of this property by combining an index over selected paths in the graph with an index over the query reads. This enables PSI to find all seeds while eliminating the need to prune the graph. We demonstrate its performance with different parameter settings on both simulated data and on a whole human genome graph constructed from variants in the 1000 Genome Project dataset. On this graph, PSI outperforms GCSA2 in terms of index size, query time and sensitivity. AVAILABILITY AND IMPLEMENTATION: The C++ implementation is publicly available at: https://github.com/cartoonist/psi. Oxford University Press 2019-07 2019-07-05 /pmc/articles/PMC6612829/ /pubmed/31510650 http://dx.doi.org/10.1093/bioinformatics/btz341 Text en © The Author(s) 2019. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Ismb/Eccb 2019 Conference Proceedings
Ghaffaari, Ali
Marschall, Tobias
Fully-sensitive seed finding in sequence graphs using a hybrid index
title Fully-sensitive seed finding in sequence graphs using a hybrid index
title_full Fully-sensitive seed finding in sequence graphs using a hybrid index
title_fullStr Fully-sensitive seed finding in sequence graphs using a hybrid index
title_full_unstemmed Fully-sensitive seed finding in sequence graphs using a hybrid index
title_short Fully-sensitive seed finding in sequence graphs using a hybrid index
title_sort fully-sensitive seed finding in sequence graphs using a hybrid index
topic Ismb/Eccb 2019 Conference Proceedings
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6612829/
https://www.ncbi.nlm.nih.gov/pubmed/31510650
http://dx.doi.org/10.1093/bioinformatics/btz341
work_keys_str_mv AT ghaffaariali fullysensitiveseedfindinginsequencegraphsusingahybridindex
AT marschalltobias fullysensitiveseedfindinginsequencegraphsusingahybridindex