Cargando…
Minimal Positional Substring Cover: A Haplotype Threading Alternative to Li & Stephens Model
The Li & Stephens (LS) hidden Markov model (HMM) models the process of reconstructing a haplotype as a mosaic copy of haplotypes in a reference panel (haplotype threading). For small panels the probabilistic parameterization of LS enables modeling the uncertainties of such mosaics, and has been...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Cold Spring Harbor Laboratory
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9881975/ https://www.ncbi.nlm.nih.gov/pubmed/36711469 http://dx.doi.org/10.1101/2023.01.04.522803 |
_version_ | 1784879218149031936 |
---|---|
author | Sanaullah, Ahsan Zhi, Degui Zhang, Shaojie |
author_facet | Sanaullah, Ahsan Zhi, Degui Zhang, Shaojie |
author_sort | Sanaullah, Ahsan |
collection | PubMed |
description | The Li & Stephens (LS) hidden Markov model (HMM) models the process of reconstructing a haplotype as a mosaic copy of haplotypes in a reference panel (haplotype threading). For small panels the probabilistic parameterization of LS enables modeling the uncertainties of such mosaics, and has been the foundational model for haplotype phasing and imputation. However, LS becomes inefficient when sample size is large (tens of thousands to millions), because of its linear time complexity (O(MN), where M is the number of haplotypes and N is the number of sites in the panel). Recently the PBWT, an efficient data structure capturing the local haplotype matching among haplotypes, was proposed to offer fast methods for giving some optimal solution (Viterbi) to the LS HMM. But the solution space of the LS for large panels is still elusive. Previously we introduced the Minimal Positional Substring Cover (MPSC) problem as an alternative formulation of LS whose objective is to cover a query haplotype by a minimum number of segments from haplotypes in a reference panel. The MPSC formulation allows the generation of a haplotype threading in time constant to sample size (O(N)). This allows haplotype threading on very large biobank scale panels on which the LS model is infeasible. Here we present new results on the solution space of the MPSC by first identifying a property that any MPSC will have a set of required regions, and then proposing a MPSC graph. In addition, we derived a number of optimal algorithms for MPSC, including solution enumerations, the Length Maximal MPSC, and h-MPSC solutions. In doing so, our algorithms reveal the solution space of LS for large panels. Even though we only solved an extreme case of LS where the emission probability is 0, our algorithms can be made more robust by PBWT smoothing. We show that our method is informative in terms of revealing the characteristics of biobank-scale data sets and can improve genotype imputation. |
format | Online Article Text |
id | pubmed-9881975 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Cold Spring Harbor Laboratory |
record_format | MEDLINE/PubMed |
spelling | pubmed-98819752023-01-28 Minimal Positional Substring Cover: A Haplotype Threading Alternative to Li & Stephens Model Sanaullah, Ahsan Zhi, Degui Zhang, Shaojie bioRxiv Article The Li & Stephens (LS) hidden Markov model (HMM) models the process of reconstructing a haplotype as a mosaic copy of haplotypes in a reference panel (haplotype threading). For small panels the probabilistic parameterization of LS enables modeling the uncertainties of such mosaics, and has been the foundational model for haplotype phasing and imputation. However, LS becomes inefficient when sample size is large (tens of thousands to millions), because of its linear time complexity (O(MN), where M is the number of haplotypes and N is the number of sites in the panel). Recently the PBWT, an efficient data structure capturing the local haplotype matching among haplotypes, was proposed to offer fast methods for giving some optimal solution (Viterbi) to the LS HMM. But the solution space of the LS for large panels is still elusive. Previously we introduced the Minimal Positional Substring Cover (MPSC) problem as an alternative formulation of LS whose objective is to cover a query haplotype by a minimum number of segments from haplotypes in a reference panel. The MPSC formulation allows the generation of a haplotype threading in time constant to sample size (O(N)). This allows haplotype threading on very large biobank scale panels on which the LS model is infeasible. Here we present new results on the solution space of the MPSC by first identifying a property that any MPSC will have a set of required regions, and then proposing a MPSC graph. In addition, we derived a number of optimal algorithms for MPSC, including solution enumerations, the Length Maximal MPSC, and h-MPSC solutions. In doing so, our algorithms reveal the solution space of LS for large panels. Even though we only solved an extreme case of LS where the emission probability is 0, our algorithms can be made more robust by PBWT smoothing. We show that our method is informative in terms of revealing the characteristics of biobank-scale data sets and can improve genotype imputation. Cold Spring Harbor Laboratory 2023-01-06 /pmc/articles/PMC9881975/ /pubmed/36711469 http://dx.doi.org/10.1101/2023.01.04.522803 Text en https://creativecommons.org/licenses/by-nc-nd/4.0/This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (https://creativecommons.org/licenses/by-nc-nd/4.0/) , which allows reusers to copy and distribute the material in any medium or format in unadapted form only, for noncommercial purposes only, and only so long as attribution is given to the creator. |
spellingShingle | Article Sanaullah, Ahsan Zhi, Degui Zhang, Shaojie Minimal Positional Substring Cover: A Haplotype Threading Alternative to Li & Stephens Model |
title | Minimal Positional Substring Cover: A Haplotype Threading Alternative to Li & Stephens Model |
title_full | Minimal Positional Substring Cover: A Haplotype Threading Alternative to Li & Stephens Model |
title_fullStr | Minimal Positional Substring Cover: A Haplotype Threading Alternative to Li & Stephens Model |
title_full_unstemmed | Minimal Positional Substring Cover: A Haplotype Threading Alternative to Li & Stephens Model |
title_short | Minimal Positional Substring Cover: A Haplotype Threading Alternative to Li & Stephens Model |
title_sort | minimal positional substring cover: a haplotype threading alternative to li & stephens model |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9881975/ https://www.ncbi.nlm.nih.gov/pubmed/36711469 http://dx.doi.org/10.1101/2023.01.04.522803 |
work_keys_str_mv | AT sanaullahahsan minimalpositionalsubstringcoverahaplotypethreadingalternativetolistephensmodel AT zhidegui minimalpositionalsubstringcoverahaplotypethreadingalternativetolistephensmodel AT zhangshaojie minimalpositionalsubstringcoverahaplotypethreadingalternativetolistephensmodel |