Cargando…

μ- PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data

MOTIVATION: The Positional Burrows–Wheeler Transform ([Formula: see text]) is a data structure that indexes haplotype sequences in a manner that enables finding maximal haplotype matches in h sequences containing w variation sites in [Formula: see text] time. This represents a significant improvemen...

Descripción completa

Detalles Bibliográficos
Autores principales:	Cozzi, Davide, Rossi, Massimiliano, Rubinacci, Simone, Gagie, Travis, Köppl, Dominik, Boucher, Christina, Bonizzoni, Paola
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2023
Materias:	Original Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10502237/ https://www.ncbi.nlm.nih.gov/pubmed/37688560 http://dx.doi.org/10.1093/bioinformatics/btad552

_version_	1785106277153636352
author	Cozzi, Davide Rossi, Massimiliano Rubinacci, Simone Gagie, Travis Köppl, Dominik Boucher, Christina Bonizzoni, Paola
author_facet	Cozzi, Davide Rossi, Massimiliano Rubinacci, Simone Gagie, Travis Köppl, Dominik Boucher, Christina Bonizzoni, Paola
author_sort	Cozzi, Davide
collection	PubMed
description	MOTIVATION: The Positional Burrows–Wheeler Transform ([Formula: see text]) is a data structure that indexes haplotype sequences in a manner that enables finding maximal haplotype matches in h sequences containing w variation sites in [Formula: see text] time. This represents a significant improvement over classical quadratic-time approaches. However, the original PBWT data structure does not allow for queries over Biobank panels that consist of several millions of haplotypes, if an index of the haplotypes must be kept entirely in memory. RESULTS: In this article, we leverage the notion of r-index proposed for the BWT to present a memory-efficient method for constructing and storing the run-length encoded PBWT, and computing set maximal matches (SMEMs) queries in haplotype sequences. We implement our method, which we refer to as [Formula: see text] , and evaluate it on datasets of 1000 Genome Project and UK Biobank data. Our experiments demonstrate that the [Formula: see text] reduces the memory usage up to a factor of 20% compared to the best current PBWT-based indexing. In particular, [Formula: see text] produces an index that stores high-coverage whole genome sequencing data of chromosome 20 in about a third of the space of its BCF file. [Formula: see text] is an adaptation of techniques for the run-length compressed [Formula: see text] for the PBWT (RLPBWT) and it is based on keeping in memory only a succinct representation of the RLPBWT that still allows the efficient computation of set maximal matches (SMEMs) over the original panel. AVAILABILITY AND IMPLEMENTATION: Our implementation is open source and available at https://github.com/dlcgold/muPBWT. The binary is available at https://bioconda.github.io/recipes/mupbwt/README.html.
format	Online Article Text
id	pubmed-10502237
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-105022372023-09-16 μ- PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data Cozzi, Davide Rossi, Massimiliano Rubinacci, Simone Gagie, Travis Köppl, Dominik Boucher, Christina Bonizzoni, Paola Bioinformatics Original Paper MOTIVATION: The Positional Burrows–Wheeler Transform ([Formula: see text]) is a data structure that indexes haplotype sequences in a manner that enables finding maximal haplotype matches in h sequences containing w variation sites in [Formula: see text] time. This represents a significant improvement over classical quadratic-time approaches. However, the original PBWT data structure does not allow for queries over Biobank panels that consist of several millions of haplotypes, if an index of the haplotypes must be kept entirely in memory. RESULTS: In this article, we leverage the notion of r-index proposed for the BWT to present a memory-efficient method for constructing and storing the run-length encoded PBWT, and computing set maximal matches (SMEMs) queries in haplotype sequences. We implement our method, which we refer to as [Formula: see text] , and evaluate it on datasets of 1000 Genome Project and UK Biobank data. Our experiments demonstrate that the [Formula: see text] reduces the memory usage up to a factor of 20% compared to the best current PBWT-based indexing. In particular, [Formula: see text] produces an index that stores high-coverage whole genome sequencing data of chromosome 20 in about a third of the space of its BCF file. [Formula: see text] is an adaptation of techniques for the run-length compressed [Formula: see text] for the PBWT (RLPBWT) and it is based on keeping in memory only a succinct representation of the RLPBWT that still allows the efficient computation of set maximal matches (SMEMs) over the original panel. AVAILABILITY AND IMPLEMENTATION: Our implementation is open source and available at https://github.com/dlcgold/muPBWT. The binary is available at https://bioconda.github.io/recipes/mupbwt/README.html. Oxford University Press 2023-09-09 /pmc/articles/PMC10502237/ /pubmed/37688560 http://dx.doi.org/10.1093/bioinformatics/btad552 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Paper Cozzi, Davide Rossi, Massimiliano Rubinacci, Simone Gagie, Travis Köppl, Dominik Boucher, Christina Bonizzoni, Paola μ- PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data
title	μ- PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data
title_full	μ- PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data
title_fullStr	μ- PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data
title_full_unstemmed	μ- PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data
title_short	μ- PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data
title_sort	μ- pbwt: a lightweight r-indexing of the pbwt for storing and querying uk biobank data
topic	Original Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10502237/ https://www.ncbi.nlm.nih.gov/pubmed/37688560 http://dx.doi.org/10.1093/bioinformatics/btad552
work_keys_str_mv	AT cozzidavide mpbwtalightweightrindexingofthepbwtforstoringandqueryingukbiobankdata AT rossimassimiliano mpbwtalightweightrindexingofthepbwtforstoringandqueryingukbiobankdata AT rubinaccisimone mpbwtalightweightrindexingofthepbwtforstoringandqueryingukbiobankdata AT gagietravis mpbwtalightweightrindexingofthepbwtforstoringandqueryingukbiobankdata AT koppldominik mpbwtalightweightrindexingofthepbwtforstoringandqueryingukbiobankdata AT boucherchristina mpbwtalightweightrindexingofthepbwtforstoringandqueryingukbiobankdata AT bonizzonipaola mpbwtalightweightrindexingofthepbwtforstoringandqueryingukbiobankdata

μ- PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data

Ejemplares similares