Cargando…

Syllable-PBWT for space-efficient haplotype long-match query

MOTIVATION: The positional Burrows–Wheeler transform (PBWT) has led to tremendous strides in haplotype matching on biobank-scale data. For genetic genealogical search, PBWT-based methods have optimized the asymptotic runtime of finding long matches between a query haplotype and a predefined panel of...

Descripción completa

Detalles Bibliográficos
Autores principales:	Wang, Victor, Naseri, Ardalan, Zhang, Shaojie, Zhi, Degui
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2022
Materias:	Original Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9805553/ https://www.ncbi.nlm.nih.gov/pubmed/36440908 http://dx.doi.org/10.1093/bioinformatics/btac734

_version_	1784862351576530944
author	Wang, Victor Naseri, Ardalan Zhang, Shaojie Zhi, Degui
author_facet	Wang, Victor Naseri, Ardalan Zhang, Shaojie Zhi, Degui
author_sort	Wang, Victor
collection	PubMed
description	MOTIVATION: The positional Burrows–Wheeler transform (PBWT) has led to tremendous strides in haplotype matching on biobank-scale data. For genetic genealogical search, PBWT-based methods have optimized the asymptotic runtime of finding long matches between a query haplotype and a predefined panel of haplotypes. However, to enable fast query searches, the full-sized panel and PBWT data structures must be kept in memory, preventing existing algorithms from scaling up to modern biobank panels consisting of millions of haplotypes. In this work, we propose a space-efficient variation of PBWT named Syllable-PBWT, which divides every haplotype into syllables, builds the PBWT positional prefix arrays on the compressed syllabic panel, and leverages the polynomial rolling hash function for positional substring comparison. With the Syllable-PBWT data structures, we then present a long match query algorithm named Syllable-Query. RESULTS: Compared to the most time- and space-efficient previously published solution to the long match query problem, Syllable-Query reduced the memory use by a factor of over 100 on both the UK Biobank genotype data and the 1000 Genomes Project sequence data. Surprisingly, the smaller size of our syllabic data structures allows for more efficient iteration and CPU cache usage, granting Syllable-Query even faster runtime than existing solutions. AVAILABILITY AND IMPLEMENTATION: https://github.com/ZhiGroup/Syllable-PBWT SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format	Online Article Text
id	pubmed-9805553
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-98055532023-01-03 Syllable-PBWT for space-efficient haplotype long-match query Wang, Victor Naseri, Ardalan Zhang, Shaojie Zhi, Degui Bioinformatics Original Paper MOTIVATION: The positional Burrows–Wheeler transform (PBWT) has led to tremendous strides in haplotype matching on biobank-scale data. For genetic genealogical search, PBWT-based methods have optimized the asymptotic runtime of finding long matches between a query haplotype and a predefined panel of haplotypes. However, to enable fast query searches, the full-sized panel and PBWT data structures must be kept in memory, preventing existing algorithms from scaling up to modern biobank panels consisting of millions of haplotypes. In this work, we propose a space-efficient variation of PBWT named Syllable-PBWT, which divides every haplotype into syllables, builds the PBWT positional prefix arrays on the compressed syllabic panel, and leverages the polynomial rolling hash function for positional substring comparison. With the Syllable-PBWT data structures, we then present a long match query algorithm named Syllable-Query. RESULTS: Compared to the most time- and space-efficient previously published solution to the long match query problem, Syllable-Query reduced the memory use by a factor of over 100 on both the UK Biobank genotype data and the 1000 Genomes Project sequence data. Surprisingly, the smaller size of our syllabic data structures allows for more efficient iteration and CPU cache usage, granting Syllable-Query even faster runtime than existing solutions. AVAILABILITY AND IMPLEMENTATION: https://github.com/ZhiGroup/Syllable-PBWT SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2022-11-28 /pmc/articles/PMC9805553/ /pubmed/36440908 http://dx.doi.org/10.1093/bioinformatics/btac734 Text en © The Author(s) 2022. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Paper Wang, Victor Naseri, Ardalan Zhang, Shaojie Zhi, Degui Syllable-PBWT for space-efficient haplotype long-match query
title	Syllable-PBWT for space-efficient haplotype long-match query
title_full	Syllable-PBWT for space-efficient haplotype long-match query
title_fullStr	Syllable-PBWT for space-efficient haplotype long-match query
title_full_unstemmed	Syllable-PBWT for space-efficient haplotype long-match query
title_short	Syllable-PBWT for space-efficient haplotype long-match query
title_sort	syllable-pbwt for space-efficient haplotype long-match query
topic	Original Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9805553/ https://www.ncbi.nlm.nih.gov/pubmed/36440908 http://dx.doi.org/10.1093/bioinformatics/btac734
work_keys_str_mv	AT wangvictor syllablepbwtforspaceefficienthaplotypelongmatchquery AT naseriardalan syllablepbwtforspaceefficienthaplotypelongmatchquery AT zhangshaojie syllablepbwtforspaceefficienthaplotypelongmatchquery AT zhidegui syllablepbwtforspaceefficienthaplotypelongmatchquery

Syllable-PBWT for space-efficient haplotype long-match query

Ejemplares similares