Cargando…

Better ILP models for haplotype assembly

BACKGROUND: The haplotype assembly problem for diploid is to find a pair of haplotypes from a given set of aligned Single Nucleotide Polymorphism (SNP) fragments (reads). It has many applications in association studies, drug design, and genetic research. Since this problem is computationally hard, b...

Descripción completa

Detalles Bibliográficos
Autores principales: Etemadi, Maryam, Bagherian, Mehri, Chen, Zhi-Zhong, Wang, Lusheng
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6389102/
https://www.ncbi.nlm.nih.gov/pubmed/29504891
http://dx.doi.org/10.1186/s12859-018-2012-x
_version_ 1783397888146014208
author Etemadi, Maryam
Bagherian, Mehri
Chen, Zhi-Zhong
Wang, Lusheng
author_facet Etemadi, Maryam
Bagherian, Mehri
Chen, Zhi-Zhong
Wang, Lusheng
author_sort Etemadi, Maryam
collection PubMed
description BACKGROUND: The haplotype assembly problem for diploid is to find a pair of haplotypes from a given set of aligned Single Nucleotide Polymorphism (SNP) fragments (reads). It has many applications in association studies, drug design, and genetic research. Since this problem is computationally hard, both heuristic and exact algorithms have been designed for it. Although exact algorithms are much slower, they are still of great interest because they usually output significantly better solutions than heuristic algorithms in terms of popular measures such as the Minimum Error Correction (MEC) score, the number of switch errors, and the QAN50 score. Exact algorithms are also valuable because they can be used to witness how good a heuristic algorithm is. The best known exact algorithm is based on integer linear programming (ILP) and it is known that ILP can also be used to improve the output quality of every heuristic algorithm with a little decline in speed. Therefore, faster ILP models for the problem are highly demanded. RESULTS: As in previous studies, we consider not only the general case of the problem but also its all-heterozygous case where we assume that if a column of the input read matrix contains at least one 0 and one 1, then it corresponds to a heterozygous SNP site. For both cases, we design new ILP models for the haplotype assembly problem which aim at minimizing the MEC score. The new models are theoretically better because they contain significantly fewer constraints. More importantly, our experimental results show that for both simulated and real datasets, the new model for the all-heterozygous (respectively, general) case can usually be solved via CPLEX (an ILP solver) at least 5 times (respectively, twice) faster than the previous bests. Indeed, the running time can sometimes be 41 times better. CONCLUSIONS: This paper proposes a new ILP model for the haplotype assembly problem and its all-heterozygous case, respectively. Experiments with both real and simulated datasets show that the new models can be solved within much shorter time by CPLEX than the previous bests. We believe that the models can be used to improve heuristic algorithms as well.
format Online
Article
Text
id pubmed-6389102
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-63891022019-03-19 Better ILP models for haplotype assembly Etemadi, Maryam Bagherian, Mehri Chen, Zhi-Zhong Wang, Lusheng BMC Bioinformatics Methodology BACKGROUND: The haplotype assembly problem for diploid is to find a pair of haplotypes from a given set of aligned Single Nucleotide Polymorphism (SNP) fragments (reads). It has many applications in association studies, drug design, and genetic research. Since this problem is computationally hard, both heuristic and exact algorithms have been designed for it. Although exact algorithms are much slower, they are still of great interest because they usually output significantly better solutions than heuristic algorithms in terms of popular measures such as the Minimum Error Correction (MEC) score, the number of switch errors, and the QAN50 score. Exact algorithms are also valuable because they can be used to witness how good a heuristic algorithm is. The best known exact algorithm is based on integer linear programming (ILP) and it is known that ILP can also be used to improve the output quality of every heuristic algorithm with a little decline in speed. Therefore, faster ILP models for the problem are highly demanded. RESULTS: As in previous studies, we consider not only the general case of the problem but also its all-heterozygous case where we assume that if a column of the input read matrix contains at least one 0 and one 1, then it corresponds to a heterozygous SNP site. For both cases, we design new ILP models for the haplotype assembly problem which aim at minimizing the MEC score. The new models are theoretically better because they contain significantly fewer constraints. More importantly, our experimental results show that for both simulated and real datasets, the new model for the all-heterozygous (respectively, general) case can usually be solved via CPLEX (an ILP solver) at least 5 times (respectively, twice) faster than the previous bests. Indeed, the running time can sometimes be 41 times better. CONCLUSIONS: This paper proposes a new ILP model for the haplotype assembly problem and its all-heterozygous case, respectively. Experiments with both real and simulated datasets show that the new models can be solved within much shorter time by CPLEX than the previous bests. We believe that the models can be used to improve heuristic algorithms as well. BioMed Central 2018-02-19 /pmc/articles/PMC6389102/ /pubmed/29504891 http://dx.doi.org/10.1186/s12859-018-2012-x Text en © The Author(s) 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Methodology
Etemadi, Maryam
Bagherian, Mehri
Chen, Zhi-Zhong
Wang, Lusheng
Better ILP models for haplotype assembly
title Better ILP models for haplotype assembly
title_full Better ILP models for haplotype assembly
title_fullStr Better ILP models for haplotype assembly
title_full_unstemmed Better ILP models for haplotype assembly
title_short Better ILP models for haplotype assembly
title_sort better ilp models for haplotype assembly
topic Methodology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6389102/
https://www.ncbi.nlm.nih.gov/pubmed/29504891
http://dx.doi.org/10.1186/s12859-018-2012-x
work_keys_str_mv AT etemadimaryam betterilpmodelsforhaplotypeassembly
AT bagherianmehri betterilpmodelsforhaplotypeassembly
AT chenzhizhong betterilpmodelsforhaplotypeassembly
AT wanglusheng betterilpmodelsforhaplotypeassembly