Cargando…

Efficient haplotype matching between a query and a panel for genealogical search

MOTIVATION: With the wide availability of whole-genome genotype data, there is an increasing need for conducting genetic genealogical searches efficiently. Computationally, this task amounts to identifying shared DNA segments between a query individual and a very large panel containing millions of h...

Descripción completa

Detalles Bibliográficos
Autores principales:	Naseri, Ardalan, Holzhauser, Erwin, Zhi, Degui, Zhang, Shaojie
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2019
Materias:	Ismb/Eccb 2019 Conference Proceedings
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6612857/ https://www.ncbi.nlm.nih.gov/pubmed/31510689 http://dx.doi.org/10.1093/bioinformatics/btz347

_version_	1783432952933253120
author	Naseri, Ardalan Holzhauser, Erwin Zhi, Degui Zhang, Shaojie
author_facet	Naseri, Ardalan Holzhauser, Erwin Zhi, Degui Zhang, Shaojie
author_sort	Naseri, Ardalan
collection	PubMed
description	MOTIVATION: With the wide availability of whole-genome genotype data, there is an increasing need for conducting genetic genealogical searches efficiently. Computationally, this task amounts to identifying shared DNA segments between a query individual and a very large panel containing millions of haplotypes. The celebrated Positional Burrows-Wheeler Transform (PBWT) data structure is a pre-computed index of the panel that enables constant time matching at each position between one haplotype and an arbitrarily large panel. However, the existing algorithm (Durbin’s Algorithm 5) can only identify set-maximal matches, the longest matches ending at any location in a panel, while in real genealogical search scenarios, multiple ‘good enough’ matches are desired. RESULTS: In this work, we developed two algorithmic extensions of Durbin’s Algorithm 5, that can find all L-long matches, matches longer than or equal to a given length L, between a query and a panel. In the first algorithm, PBWT-Query, we introduce ‘virtual insertion’ of the query into the PBWT matrix of the panel, and then scanning up and down for the PBWT match blocks with length greater than L. In our second algorithm, L-PBWT-Query, we further speed up PBWT-Query by introducing additional data structures that allow us to avoid iterating through blocks of incomplete matches. The efficiency of PBWT-Query and L-PBWT-Query is demonstrated using the simulated data and the UK Biobank data. Our results show that our proposed algorithms can detect related individuals for a given query efficiently in very large cohorts which enables a fast on-line query search. AVAILABILITY AND IMPLEMENTATION: genome.ucf.edu/pbwt-query SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format	Online Article Text
id	pubmed-6612857
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-66128572019-07-12 Efficient haplotype matching between a query and a panel for genealogical search Naseri, Ardalan Holzhauser, Erwin Zhi, Degui Zhang, Shaojie Bioinformatics Ismb/Eccb 2019 Conference Proceedings MOTIVATION: With the wide availability of whole-genome genotype data, there is an increasing need for conducting genetic genealogical searches efficiently. Computationally, this task amounts to identifying shared DNA segments between a query individual and a very large panel containing millions of haplotypes. The celebrated Positional Burrows-Wheeler Transform (PBWT) data structure is a pre-computed index of the panel that enables constant time matching at each position between one haplotype and an arbitrarily large panel. However, the existing algorithm (Durbin’s Algorithm 5) can only identify set-maximal matches, the longest matches ending at any location in a panel, while in real genealogical search scenarios, multiple ‘good enough’ matches are desired. RESULTS: In this work, we developed two algorithmic extensions of Durbin’s Algorithm 5, that can find all L-long matches, matches longer than or equal to a given length L, between a query and a panel. In the first algorithm, PBWT-Query, we introduce ‘virtual insertion’ of the query into the PBWT matrix of the panel, and then scanning up and down for the PBWT match blocks with length greater than L. In our second algorithm, L-PBWT-Query, we further speed up PBWT-Query by introducing additional data structures that allow us to avoid iterating through blocks of incomplete matches. The efficiency of PBWT-Query and L-PBWT-Query is demonstrated using the simulated data and the UK Biobank data. Our results show that our proposed algorithms can detect related individuals for a given query efficiently in very large cohorts which enables a fast on-line query search. AVAILABILITY AND IMPLEMENTATION: genome.ucf.edu/pbwt-query SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2019-07 2019-07-05 /pmc/articles/PMC6612857/ /pubmed/31510689 http://dx.doi.org/10.1093/bioinformatics/btz347 Text en © The Author(s) 2019. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle	Ismb/Eccb 2019 Conference Proceedings Naseri, Ardalan Holzhauser, Erwin Zhi, Degui Zhang, Shaojie Efficient haplotype matching between a query and a panel for genealogical search
title	Efficient haplotype matching between a query and a panel for genealogical search
title_full	Efficient haplotype matching between a query and a panel for genealogical search
title_fullStr	Efficient haplotype matching between a query and a panel for genealogical search
title_full_unstemmed	Efficient haplotype matching between a query and a panel for genealogical search
title_short	Efficient haplotype matching between a query and a panel for genealogical search
title_sort	efficient haplotype matching between a query and a panel for genealogical search
topic	Ismb/Eccb 2019 Conference Proceedings
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6612857/ https://www.ncbi.nlm.nih.gov/pubmed/31510689 http://dx.doi.org/10.1093/bioinformatics/btz347
work_keys_str_mv	AT naseriardalan efficienthaplotypematchingbetweenaqueryandapanelforgenealogicalsearch AT holzhausererwin efficienthaplotypematchingbetweenaqueryandapanelforgenealogicalsearch AT zhidegui efficienthaplotypematchingbetweenaqueryandapanelforgenealogicalsearch AT zhangshaojie efficienthaplotypematchingbetweenaqueryandapanelforgenealogicalsearch

Efficient haplotype matching between a query and a panel for genealogical search

Ejemplares similares