Cargando…

Minimal Positional Substring Cover: A Haplotype Threading Alternative to Li & Stephens Model

The Li & Stephens (LS) hidden Markov model (HMM) models the process of reconstructing a haplotype as a mosaic copy of haplotypes in a reference panel (haplotype threading). For small panels the probabilistic parameterization of LS enables modeling the uncertainties of such mosaics, and has been...

Descripción completa

Detalles Bibliográficos
Autores principales: Sanaullah, Ahsan, Zhi, Degui, Zhang, Shaojie
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cold Spring Harbor Laboratory 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9881975/
https://www.ncbi.nlm.nih.gov/pubmed/36711469
http://dx.doi.org/10.1101/2023.01.04.522803
_version_ 1784879218149031936
author Sanaullah, Ahsan
Zhi, Degui
Zhang, Shaojie
author_facet Sanaullah, Ahsan
Zhi, Degui
Zhang, Shaojie
author_sort Sanaullah, Ahsan
collection PubMed
description The Li & Stephens (LS) hidden Markov model (HMM) models the process of reconstructing a haplotype as a mosaic copy of haplotypes in a reference panel (haplotype threading). For small panels the probabilistic parameterization of LS enables modeling the uncertainties of such mosaics, and has been the foundational model for haplotype phasing and imputation. However, LS becomes inefficient when sample size is large (tens of thousands to millions), because of its linear time complexity (O(MN), where M is the number of haplotypes and N is the number of sites in the panel). Recently the PBWT, an efficient data structure capturing the local haplotype matching among haplotypes, was proposed to offer fast methods for giving some optimal solution (Viterbi) to the LS HMM. But the solution space of the LS for large panels is still elusive. Previously we introduced the Minimal Positional Substring Cover (MPSC) problem as an alternative formulation of LS whose objective is to cover a query haplotype by a minimum number of segments from haplotypes in a reference panel. The MPSC formulation allows the generation of a haplotype threading in time constant to sample size (O(N)). This allows haplotype threading on very large biobank scale panels on which the LS model is infeasible. Here we present new results on the solution space of the MPSC by first identifying a property that any MPSC will have a set of required regions, and then proposing a MPSC graph. In addition, we derived a number of optimal algorithms for MPSC, including solution enumerations, the Length Maximal MPSC, and h-MPSC solutions. In doing so, our algorithms reveal the solution space of LS for large panels. Even though we only solved an extreme case of LS where the emission probability is 0, our algorithms can be made more robust by PBWT smoothing. We show that our method is informative in terms of revealing the characteristics of biobank-scale data sets and can improve genotype imputation.
format Online
Article
Text
id pubmed-9881975
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Cold Spring Harbor Laboratory
record_format MEDLINE/PubMed
spelling pubmed-98819752023-01-28 Minimal Positional Substring Cover: A Haplotype Threading Alternative to Li & Stephens Model Sanaullah, Ahsan Zhi, Degui Zhang, Shaojie bioRxiv Article The Li & Stephens (LS) hidden Markov model (HMM) models the process of reconstructing a haplotype as a mosaic copy of haplotypes in a reference panel (haplotype threading). For small panels the probabilistic parameterization of LS enables modeling the uncertainties of such mosaics, and has been the foundational model for haplotype phasing and imputation. However, LS becomes inefficient when sample size is large (tens of thousands to millions), because of its linear time complexity (O(MN), where M is the number of haplotypes and N is the number of sites in the panel). Recently the PBWT, an efficient data structure capturing the local haplotype matching among haplotypes, was proposed to offer fast methods for giving some optimal solution (Viterbi) to the LS HMM. But the solution space of the LS for large panels is still elusive. Previously we introduced the Minimal Positional Substring Cover (MPSC) problem as an alternative formulation of LS whose objective is to cover a query haplotype by a minimum number of segments from haplotypes in a reference panel. The MPSC formulation allows the generation of a haplotype threading in time constant to sample size (O(N)). This allows haplotype threading on very large biobank scale panels on which the LS model is infeasible. Here we present new results on the solution space of the MPSC by first identifying a property that any MPSC will have a set of required regions, and then proposing a MPSC graph. In addition, we derived a number of optimal algorithms for MPSC, including solution enumerations, the Length Maximal MPSC, and h-MPSC solutions. In doing so, our algorithms reveal the solution space of LS for large panels. Even though we only solved an extreme case of LS where the emission probability is 0, our algorithms can be made more robust by PBWT smoothing. We show that our method is informative in terms of revealing the characteristics of biobank-scale data sets and can improve genotype imputation. Cold Spring Harbor Laboratory 2023-01-06 /pmc/articles/PMC9881975/ /pubmed/36711469 http://dx.doi.org/10.1101/2023.01.04.522803 Text en https://creativecommons.org/licenses/by-nc-nd/4.0/This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (https://creativecommons.org/licenses/by-nc-nd/4.0/) , which allows reusers to copy and distribute the material in any medium or format in unadapted form only, for noncommercial purposes only, and only so long as attribution is given to the creator.
spellingShingle Article
Sanaullah, Ahsan
Zhi, Degui
Zhang, Shaojie
Minimal Positional Substring Cover: A Haplotype Threading Alternative to Li & Stephens Model
title Minimal Positional Substring Cover: A Haplotype Threading Alternative to Li & Stephens Model
title_full Minimal Positional Substring Cover: A Haplotype Threading Alternative to Li & Stephens Model
title_fullStr Minimal Positional Substring Cover: A Haplotype Threading Alternative to Li & Stephens Model
title_full_unstemmed Minimal Positional Substring Cover: A Haplotype Threading Alternative to Li & Stephens Model
title_short Minimal Positional Substring Cover: A Haplotype Threading Alternative to Li & Stephens Model
title_sort minimal positional substring cover: a haplotype threading alternative to li & stephens model
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9881975/
https://www.ncbi.nlm.nih.gov/pubmed/36711469
http://dx.doi.org/10.1101/2023.01.04.522803
work_keys_str_mv AT sanaullahahsan minimalpositionalsubstringcoverahaplotypethreadingalternativetolistephensmodel
AT zhidegui minimalpositionalsubstringcoverahaplotypethreadingalternativetolistephensmodel
AT zhangshaojie minimalpositionalsubstringcoverahaplotypethreadingalternativetolistephensmodel