Cargando…

Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT)

Motivation: Over the last few years, methods based on suffix arrays using the Burrows–Wheeler Transform have been widely used for DNA sequence read matching and assembly. These provide very fast search algorithms, linear in the search pattern size, on a highly compressible representation of the data...

Descripción completa

Detalles Bibliográficos
Autor principal:	Durbin, Richard
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2014
Materias:	Original Papers
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3998136/ https://www.ncbi.nlm.nih.gov/pubmed/24413527 http://dx.doi.org/10.1093/bioinformatics/btu014

_version_	1782313300744208384
author	Durbin, Richard
author_facet	Durbin, Richard
author_sort	Durbin, Richard
collection	PubMed
description	Motivation: Over the last few years, methods based on suffix arrays using the Burrows–Wheeler Transform have been widely used for DNA sequence read matching and assembly. These provide very fast search algorithms, linear in the search pattern size, on a highly compressible representation of the dataset being searched. Meanwhile, algorithmic development for genotype data has concentrated on statistical methods for phasing and imputation, based on probabilistic matching to hidden Markov model representations of the reference data, which while powerful are much less computationally efficient. Here a theory of haplotype matching using suffix array ideas is developed, which should scale too much larger datasets than those currently handled by genotype algorithms. Results: Given M sequences with N bi-allelic variable sites, an O(NM) algorithm to derive a representation of the data based on positional prefix arrays is given, which is termed the positional Burrows–Wheeler transform (PBWT). On large datasets this compresses with run-length encoding by more than a factor of a hundred smaller than using gzip on the raw data. Using this representation a method is given to find all maximal haplotype matches within the set in O(NM) time rather than O(NM(2)) as expected from naive pairwise comparison, and also a fast algorithm, empirically independent of M given sufficient memory for indexes, to find maximal matches between a new sequence and the set. The discussion includes some proposals about how these approaches could be used for imputation and phasing. Availability: http://github.com/richarddurbin/pbwt Contact: richard.durbin@sanger.ac.uk
format	Online Article Text
id	pubmed-3998136
institution	National Center for Biotechnology Information
language	English
publishDate	2014
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-39981362014-04-24 Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT) Durbin, Richard Bioinformatics Original Papers Motivation: Over the last few years, methods based on suffix arrays using the Burrows–Wheeler Transform have been widely used for DNA sequence read matching and assembly. These provide very fast search algorithms, linear in the search pattern size, on a highly compressible representation of the dataset being searched. Meanwhile, algorithmic development for genotype data has concentrated on statistical methods for phasing and imputation, based on probabilistic matching to hidden Markov model representations of the reference data, which while powerful are much less computationally efficient. Here a theory of haplotype matching using suffix array ideas is developed, which should scale too much larger datasets than those currently handled by genotype algorithms. Results: Given M sequences with N bi-allelic variable sites, an O(NM) algorithm to derive a representation of the data based on positional prefix arrays is given, which is termed the positional Burrows–Wheeler transform (PBWT). On large datasets this compresses with run-length encoding by more than a factor of a hundred smaller than using gzip on the raw data. Using this representation a method is given to find all maximal haplotype matches within the set in O(NM) time rather than O(NM(2)) as expected from naive pairwise comparison, and also a fast algorithm, empirically independent of M given sufficient memory for indexes, to find maximal matches between a new sequence and the set. The discussion includes some proposals about how these approaches could be used for imputation and phasing. Availability: http://github.com/richarddurbin/pbwt Contact: richard.durbin@sanger.ac.uk Oxford University Press 2014-05-01 2014-01-09 /pmc/articles/PMC3998136/ /pubmed/24413527 http://dx.doi.org/10.1093/bioinformatics/btu014 Text en © The Author 2014. Published by Oxford University Press. http://creativecommons.org/licenses/by/3.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Papers Durbin, Richard Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT)
title	Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT)
title_full	Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT)
title_fullStr	Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT)
title_full_unstemmed	Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT)
title_short	Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT)
title_sort	efficient haplotype matching and storage using the positional burrows–wheeler transform (pbwt)
topic	Original Papers
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3998136/ https://www.ncbi.nlm.nih.gov/pubmed/24413527 http://dx.doi.org/10.1093/bioinformatics/btu014
work_keys_str_mv	AT durbinrichard efficienthaplotypematchingandstorageusingthepositionalburrowswheelertransformpbwt

Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT)

Ejemplares similares