Cargando…

Efficient haplotype matching between a query and a panel for genealogical search

MOTIVATION: With the wide availability of whole-genome genotype data, there is an increasing need for conducting genetic genealogical searches efficiently. Computationally, this task amounts to identifying shared DNA segments between a query individual and a very large panel containing millions of h...

Descripción completa

Detalles Bibliográficos
Autores principales: Naseri, Ardalan, Holzhauser, Erwin, Zhi, Degui, Zhang, Shaojie
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6612857/
https://www.ncbi.nlm.nih.gov/pubmed/31510689
http://dx.doi.org/10.1093/bioinformatics/btz347
_version_ 1783432952933253120
author Naseri, Ardalan
Holzhauser, Erwin
Zhi, Degui
Zhang, Shaojie
author_facet Naseri, Ardalan
Holzhauser, Erwin
Zhi, Degui
Zhang, Shaojie
author_sort Naseri, Ardalan
collection PubMed
description MOTIVATION: With the wide availability of whole-genome genotype data, there is an increasing need for conducting genetic genealogical searches efficiently. Computationally, this task amounts to identifying shared DNA segments between a query individual and a very large panel containing millions of haplotypes. The celebrated Positional Burrows-Wheeler Transform (PBWT) data structure is a pre-computed index of the panel that enables constant time matching at each position between one haplotype and an arbitrarily large panel. However, the existing algorithm (Durbin’s Algorithm 5) can only identify set-maximal matches, the longest matches ending at any location in a panel, while in real genealogical search scenarios, multiple ‘good enough’ matches are desired. RESULTS: In this work, we developed two algorithmic extensions of Durbin’s Algorithm 5, that can find all L-long matches, matches longer than or equal to a given length L, between a query and a panel. In the first algorithm, PBWT-Query, we introduce ‘virtual insertion’ of the query into the PBWT matrix of the panel, and then scanning up and down for the PBWT match blocks with length greater than L. In our second algorithm, L-PBWT-Query, we further speed up PBWT-Query by introducing additional data structures that allow us to avoid iterating through blocks of incomplete matches. The efficiency of PBWT-Query and L-PBWT-Query is demonstrated using the simulated data and the UK Biobank data. Our results show that our proposed algorithms can detect related individuals for a given query efficiently in very large cohorts which enables a fast on-line query search. AVAILABILITY AND IMPLEMENTATION: genome.ucf.edu/pbwt-query SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-6612857
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-66128572019-07-12 Efficient haplotype matching between a query and a panel for genealogical search Naseri, Ardalan Holzhauser, Erwin Zhi, Degui Zhang, Shaojie Bioinformatics Ismb/Eccb 2019 Conference Proceedings MOTIVATION: With the wide availability of whole-genome genotype data, there is an increasing need for conducting genetic genealogical searches efficiently. Computationally, this task amounts to identifying shared DNA segments between a query individual and a very large panel containing millions of haplotypes. The celebrated Positional Burrows-Wheeler Transform (PBWT) data structure is a pre-computed index of the panel that enables constant time matching at each position between one haplotype and an arbitrarily large panel. However, the existing algorithm (Durbin’s Algorithm 5) can only identify set-maximal matches, the longest matches ending at any location in a panel, while in real genealogical search scenarios, multiple ‘good enough’ matches are desired. RESULTS: In this work, we developed two algorithmic extensions of Durbin’s Algorithm 5, that can find all L-long matches, matches longer than or equal to a given length L, between a query and a panel. In the first algorithm, PBWT-Query, we introduce ‘virtual insertion’ of the query into the PBWT matrix of the panel, and then scanning up and down for the PBWT match blocks with length greater than L. In our second algorithm, L-PBWT-Query, we further speed up PBWT-Query by introducing additional data structures that allow us to avoid iterating through blocks of incomplete matches. The efficiency of PBWT-Query and L-PBWT-Query is demonstrated using the simulated data and the UK Biobank data. Our results show that our proposed algorithms can detect related individuals for a given query efficiently in very large cohorts which enables a fast on-line query search. AVAILABILITY AND IMPLEMENTATION: genome.ucf.edu/pbwt-query SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2019-07 2019-07-05 /pmc/articles/PMC6612857/ /pubmed/31510689 http://dx.doi.org/10.1093/bioinformatics/btz347 Text en © The Author(s) 2019. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Ismb/Eccb 2019 Conference Proceedings
Naseri, Ardalan
Holzhauser, Erwin
Zhi, Degui
Zhang, Shaojie
Efficient haplotype matching between a query and a panel for genealogical search
title Efficient haplotype matching between a query and a panel for genealogical search
title_full Efficient haplotype matching between a query and a panel for genealogical search
title_fullStr Efficient haplotype matching between a query and a panel for genealogical search
title_full_unstemmed Efficient haplotype matching between a query and a panel for genealogical search
title_short Efficient haplotype matching between a query and a panel for genealogical search
title_sort efficient haplotype matching between a query and a panel for genealogical search
topic Ismb/Eccb 2019 Conference Proceedings
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6612857/
https://www.ncbi.nlm.nih.gov/pubmed/31510689
http://dx.doi.org/10.1093/bioinformatics/btz347
work_keys_str_mv AT naseriardalan efficienthaplotypematchingbetweenaqueryandapanelforgenealogicalsearch
AT holzhausererwin efficienthaplotypematchingbetweenaqueryandapanelforgenealogicalsearch
AT zhidegui efficienthaplotypematchingbetweenaqueryandapanelforgenealogicalsearch
AT zhangshaojie efficienthaplotypematchingbetweenaqueryandapanelforgenealogicalsearch