Cargando…

Optimal algorithms for haplotype assembly from whole-genome sequence data

Motivation: Haplotype inference is an important step for many types of analyses of genetic variation in the human genome. Traditional approaches for obtaining haplotypes involve collecting genotype information from a population of individuals and then applying a haplotype inference algorithm. The de...

Descripción completa

Detalles Bibliográficos
Autores principales:	He, Dan, Choi, Arthur, Pipatsrisawat, Knot, Darwiche, Adnan, Eskin, Eleazar
Formato:	Texto
Lenguaje:	English
Publicado:	Oxford University Press 2010
Materias:	Ismb 2010 Conference Proceedings July 11 to July 13, 2010, Boston, Ma, Usa
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2881399/ https://www.ncbi.nlm.nih.gov/pubmed/20529904 http://dx.doi.org/10.1093/bioinformatics/btq215

_version_	1782182115406774272
author	He, Dan Choi, Arthur Pipatsrisawat, Knot Darwiche, Adnan Eskin, Eleazar
author_facet	He, Dan Choi, Arthur Pipatsrisawat, Knot Darwiche, Adnan Eskin, Eleazar
author_sort	He, Dan
collection	PubMed
description	Motivation: Haplotype inference is an important step for many types of analyses of genetic variation in the human genome. Traditional approaches for obtaining haplotypes involve collecting genotype information from a population of individuals and then applying a haplotype inference algorithm. The development of high-throughput sequencing technologies allows for an alternative strategy to obtain haplotypes by combining sequence fragments. The problem of ‘haplotype assembly’ is the problem of assembling the two haplotypes for a chromosome given the collection of such fragments, or reads, and their locations in the haplotypes, which are pre-determined by mapping the reads to a reference genome. Errors in reads significantly increase the difficulty of the problem and it has been shown that the problem is NP-hard even for reads of length 2. Existing greedy and stochastic algorithms are not guaranteed to find the optimal solutions for the haplotype assembly problem. Results: In this article, we proposed a dynamic programming algorithm that is able to assemble the haplotypes optimally with time complexity O(m × 2(k) × n), where m is the number of reads, k is the length of the longest read and n is the total number of SNPs in the haplotypes. We also reduce the haplotype assembly problem into the maximum satisfiability problem that can often be solved optimally even when k is large. Taking advantage of the efficiency of our algorithm, we perform simulation experiments demonstrating that the assembly of haplotypes using reads of length typical of the current sequencing technologies is not practical. However, we demonstrate that the combination of this approach and the traditional haplotype phasing approaches allow us to practically construct haplotypes containing both common and rare variants. Contact: danhe@cs.ucla.edu
format	Text
id	pubmed-2881399
institution	National Center for Biotechnology Information
language	English
publishDate	2010
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-28813992010-06-08 Optimal algorithms for haplotype assembly from whole-genome sequence data He, Dan Choi, Arthur Pipatsrisawat, Knot Darwiche, Adnan Eskin, Eleazar Bioinformatics Ismb 2010 Conference Proceedings July 11 to July 13, 2010, Boston, Ma, Usa Motivation: Haplotype inference is an important step for many types of analyses of genetic variation in the human genome. Traditional approaches for obtaining haplotypes involve collecting genotype information from a population of individuals and then applying a haplotype inference algorithm. The development of high-throughput sequencing technologies allows for an alternative strategy to obtain haplotypes by combining sequence fragments. The problem of ‘haplotype assembly’ is the problem of assembling the two haplotypes for a chromosome given the collection of such fragments, or reads, and their locations in the haplotypes, which are pre-determined by mapping the reads to a reference genome. Errors in reads significantly increase the difficulty of the problem and it has been shown that the problem is NP-hard even for reads of length 2. Existing greedy and stochastic algorithms are not guaranteed to find the optimal solutions for the haplotype assembly problem. Results: In this article, we proposed a dynamic programming algorithm that is able to assemble the haplotypes optimally with time complexity O(m × 2(k) × n), where m is the number of reads, k is the length of the longest read and n is the total number of SNPs in the haplotypes. We also reduce the haplotype assembly problem into the maximum satisfiability problem that can often be solved optimally even when k is large. Taking advantage of the efficiency of our algorithm, we perform simulation experiments demonstrating that the assembly of haplotypes using reads of length typical of the current sequencing technologies is not practical. However, we demonstrate that the combination of this approach and the traditional haplotype phasing approaches allow us to practically construct haplotypes containing both common and rare variants. Contact: danhe@cs.ucla.edu Oxford University Press 2010-06-15 2010-06-01 /pmc/articles/PMC2881399/ /pubmed/20529904 http://dx.doi.org/10.1093/bioinformatics/btq215 Text en © The Author(s) 2010. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/2.0/uk/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Ismb 2010 Conference Proceedings July 11 to July 13, 2010, Boston, Ma, Usa He, Dan Choi, Arthur Pipatsrisawat, Knot Darwiche, Adnan Eskin, Eleazar Optimal algorithms for haplotype assembly from whole-genome sequence data
title	Optimal algorithms for haplotype assembly from whole-genome sequence data
title_full	Optimal algorithms for haplotype assembly from whole-genome sequence data
title_fullStr	Optimal algorithms for haplotype assembly from whole-genome sequence data
title_full_unstemmed	Optimal algorithms for haplotype assembly from whole-genome sequence data
title_short	Optimal algorithms for haplotype assembly from whole-genome sequence data
title_sort	optimal algorithms for haplotype assembly from whole-genome sequence data
topic	Ismb 2010 Conference Proceedings July 11 to July 13, 2010, Boston, Ma, Usa
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2881399/ https://www.ncbi.nlm.nih.gov/pubmed/20529904 http://dx.doi.org/10.1093/bioinformatics/btq215
work_keys_str_mv	AT hedan optimalalgorithmsforhaplotypeassemblyfromwholegenomesequencedata AT choiarthur optimalalgorithmsforhaplotypeassemblyfromwholegenomesequencedata AT pipatsrisawatknot optimalalgorithmsforhaplotypeassemblyfromwholegenomesequencedata AT darwicheadnan optimalalgorithmsforhaplotypeassemblyfromwholegenomesequencedata AT eskineleazar optimalalgorithmsforhaplotypeassemblyfromwholegenomesequencedata

Optimal algorithms for haplotype assembly from whole-genome sequence data

Ejemplares similares