Cargando…

An ILP solution for the gene duplication problem

BACKGROUND: The gene duplication (GD) problem seeks a species tree that implies the fewest gene duplication events across a given collection of gene trees. Solving this problem makes it possible to use large gene families with complex histories of duplication and loss to infer phylogenetic trees. Ho...

Descripción completa

Detalles Bibliográficos
Autores principales: Chang, Wen-Chieh, Burleigh, Gordon J, Fernández-Baca, David F, Eulenstein, Oliver
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3044268/
https://www.ncbi.nlm.nih.gov/pubmed/21342543
http://dx.doi.org/10.1186/1471-2105-12-S1-S14
_version_ 1782198706921013248
author Chang, Wen-Chieh
Burleigh, Gordon J
Fernández-Baca, David F
Eulenstein, Oliver
author_facet Chang, Wen-Chieh
Burleigh, Gordon J
Fernández-Baca, David F
Eulenstein, Oliver
author_sort Chang, Wen-Chieh
collection PubMed
description BACKGROUND: The gene duplication (GD) problem seeks a species tree that implies the fewest gene duplication events across a given collection of gene trees. Solving this problem makes it possible to use large gene families with complex histories of duplication and loss to infer phylogenetic trees. However, the GD problem is NP-hard, and therefore, most analyses use heuristics that lack any performance guarantee. RESULTS: We describe the first integer linear programming (ILP) formulation to solve instances of the gene duplication problem exactly. With simulations, we demonstrate that the ILP solution can solve problem instances with up to 14 taxa. Furthermore, we apply the new ILP solution to solve the gene duplication problem for the seed plant phylogeny using a 12-taxon, 6, 084-gene data set. The unique, optimal solution, which places Gnetales sister to the conifers, represents a new, large-scale genomic perspective on one of the most puzzling questions in plant systematics. CONCLUSIONS: Although the GD problem is NP-hard, our novel ILP solution for it can solve instances with data sets consisting of as many as 14 taxa and 1, 000 genes in a few hours. These are the largest instances that have been solved to optimally to date. Thus, this work can provide large-scale genomic perspectives on phylogenetic questions that previously could only be addressed by heuristic estimates.
format Text
id pubmed-3044268
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-30442682011-02-25 An ILP solution for the gene duplication problem Chang, Wen-Chieh Burleigh, Gordon J Fernández-Baca, David F Eulenstein, Oliver BMC Bioinformatics Research BACKGROUND: The gene duplication (GD) problem seeks a species tree that implies the fewest gene duplication events across a given collection of gene trees. Solving this problem makes it possible to use large gene families with complex histories of duplication and loss to infer phylogenetic trees. However, the GD problem is NP-hard, and therefore, most analyses use heuristics that lack any performance guarantee. RESULTS: We describe the first integer linear programming (ILP) formulation to solve instances of the gene duplication problem exactly. With simulations, we demonstrate that the ILP solution can solve problem instances with up to 14 taxa. Furthermore, we apply the new ILP solution to solve the gene duplication problem for the seed plant phylogeny using a 12-taxon, 6, 084-gene data set. The unique, optimal solution, which places Gnetales sister to the conifers, represents a new, large-scale genomic perspective on one of the most puzzling questions in plant systematics. CONCLUSIONS: Although the GD problem is NP-hard, our novel ILP solution for it can solve instances with data sets consisting of as many as 14 taxa and 1, 000 genes in a few hours. These are the largest instances that have been solved to optimally to date. Thus, this work can provide large-scale genomic perspectives on phylogenetic questions that previously could only be addressed by heuristic estimates. BioMed Central 2011-02-15 /pmc/articles/PMC3044268/ /pubmed/21342543 http://dx.doi.org/10.1186/1471-2105-12-S1-S14 Text en Copyright ©2011 Chang et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Chang, Wen-Chieh
Burleigh, Gordon J
Fernández-Baca, David F
Eulenstein, Oliver
An ILP solution for the gene duplication problem
title An ILP solution for the gene duplication problem
title_full An ILP solution for the gene duplication problem
title_fullStr An ILP solution for the gene duplication problem
title_full_unstemmed An ILP solution for the gene duplication problem
title_short An ILP solution for the gene duplication problem
title_sort ilp solution for the gene duplication problem
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3044268/
https://www.ncbi.nlm.nih.gov/pubmed/21342543
http://dx.doi.org/10.1186/1471-2105-12-S1-S14
work_keys_str_mv AT changwenchieh anilpsolutionforthegeneduplicationproblem
AT burleighgordonj anilpsolutionforthegeneduplicationproblem
AT fernandezbacadavidf anilpsolutionforthegeneduplicationproblem
AT eulensteinoliver anilpsolutionforthegeneduplicationproblem
AT changwenchieh ilpsolutionforthegeneduplicationproblem
AT burleighgordonj ilpsolutionforthegeneduplicationproblem
AT fernandezbacadavidf ilpsolutionforthegeneduplicationproblem
AT eulensteinoliver ilpsolutionforthegeneduplicationproblem