Cargando…

Selecting informative subsets of sparse supermatrices increases the chance to find correct trees

BACKGROUND: Character matrices with extensive missing data are frequently used in phylogenomics with potentially detrimental effects on the accuracy and robustness of tree inference. Therefore, many investigators select taxa and genes with high data coverage. Drawbacks of these selections are their...

Descripción completa

Detalles Bibliográficos
Autores principales:	Misof, Bernhard, Meyer, Benjamin, von Reumont, Björn Marcus, Kück, Patrick, Misof, Katharina, Meusemann, Karen
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2013
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3890606/ https://www.ncbi.nlm.nih.gov/pubmed/24299043 http://dx.doi.org/10.1186/1471-2105-14-348

_version_	1782299285721710592
author	Misof, Bernhard Meyer, Benjamin von Reumont, Björn Marcus Kück, Patrick Misof, Katharina Meusemann, Karen
author_facet	Misof, Bernhard Meyer, Benjamin von Reumont, Björn Marcus Kück, Patrick Misof, Katharina Meusemann, Karen
author_sort	Misof, Bernhard
collection	PubMed
description	BACKGROUND: Character matrices with extensive missing data are frequently used in phylogenomics with potentially detrimental effects on the accuracy and robustness of tree inference. Therefore, many investigators select taxa and genes with high data coverage. Drawbacks of these selections are their exclusive reliance on data coverage without consideration of actual signal in the data which might, thus, not deliver optimal data matrices in terms of potential phylogenetic signal. In order to circumvent this problem, we have developed a heuristics implemented in a software called mare which (1) assesses information content of genes in supermatrices using a measure of potential signal combined with data coverage and (2) reduces supermatrices with a simple hill climbing procedure to submatrices with high total information content. We conducted simulation studies using matrices of 50 taxa × 50 genes with heterogeneous phylogenetic signal among genes and data coverage between 10–30%. RESULTS: With matrices of 50 taxa × 50 genes with heterogeneous phylogenetic signal among genes and data coverage between 10–30% Maximum Likelihood (ML) tree reconstructions failed to recover correct trees. A selection of a data subset with the herein proposed approach increased the chance to recover correct partial trees more than 10-fold. The selection of data subsets with the herein proposed simple hill climbing procedure performed well either considering the information content or just a simple presence/absence information of genes. We also applied our approach on an empirical data set, addressing questions of vertebrate systematics. With this empirical dataset selecting a data subset with high information content and supporting a tree with high average boostrap support was most successful if information content of genes was considered. CONCLUSIONS: Our analyses of simulated and empirical data demonstrate that sparse supermatrices can be reduced on a formal basis outperforming the usually used simple selections of taxa and genes with high data coverage.
format	Online Article Text
id	pubmed-3890606
institution	National Center for Biotechnology Information
language	English
publishDate	2013
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-38906062014-01-23 Selecting informative subsets of sparse supermatrices increases the chance to find correct trees Misof, Bernhard Meyer, Benjamin von Reumont, Björn Marcus Kück, Patrick Misof, Katharina Meusemann, Karen BMC Bioinformatics Methodology Article BACKGROUND: Character matrices with extensive missing data are frequently used in phylogenomics with potentially detrimental effects on the accuracy and robustness of tree inference. Therefore, many investigators select taxa and genes with high data coverage. Drawbacks of these selections are their exclusive reliance on data coverage without consideration of actual signal in the data which might, thus, not deliver optimal data matrices in terms of potential phylogenetic signal. In order to circumvent this problem, we have developed a heuristics implemented in a software called mare which (1) assesses information content of genes in supermatrices using a measure of potential signal combined with data coverage and (2) reduces supermatrices with a simple hill climbing procedure to submatrices with high total information content. We conducted simulation studies using matrices of 50 taxa × 50 genes with heterogeneous phylogenetic signal among genes and data coverage between 10–30%. RESULTS: With matrices of 50 taxa × 50 genes with heterogeneous phylogenetic signal among genes and data coverage between 10–30% Maximum Likelihood (ML) tree reconstructions failed to recover correct trees. A selection of a data subset with the herein proposed approach increased the chance to recover correct partial trees more than 10-fold. The selection of data subsets with the herein proposed simple hill climbing procedure performed well either considering the information content or just a simple presence/absence information of genes. We also applied our approach on an empirical data set, addressing questions of vertebrate systematics. With this empirical dataset selecting a data subset with high information content and supporting a tree with high average boostrap support was most successful if information content of genes was considered. CONCLUSIONS: Our analyses of simulated and empirical data demonstrate that sparse supermatrices can be reduced on a formal basis outperforming the usually used simple selections of taxa and genes with high data coverage. BioMed Central 2013-12-03 /pmc/articles/PMC3890606/ /pubmed/24299043 http://dx.doi.org/10.1186/1471-2105-14-348 Text en Copyright © 2013 Misof et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Methodology Article Misof, Bernhard Meyer, Benjamin von Reumont, Björn Marcus Kück, Patrick Misof, Katharina Meusemann, Karen Selecting informative subsets of sparse supermatrices increases the chance to find correct trees
title	Selecting informative subsets of sparse supermatrices increases the chance to find correct trees
title_full	Selecting informative subsets of sparse supermatrices increases the chance to find correct trees
title_fullStr	Selecting informative subsets of sparse supermatrices increases the chance to find correct trees
title_full_unstemmed	Selecting informative subsets of sparse supermatrices increases the chance to find correct trees
title_short	Selecting informative subsets of sparse supermatrices increases the chance to find correct trees
title_sort	selecting informative subsets of sparse supermatrices increases the chance to find correct trees
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3890606/ https://www.ncbi.nlm.nih.gov/pubmed/24299043 http://dx.doi.org/10.1186/1471-2105-14-348
work_keys_str_mv	AT misofbernhard selectinginformativesubsetsofsparsesupermatricesincreasesthechancetofindcorrecttrees AT meyerbenjamin selectinginformativesubsetsofsparsesupermatricesincreasesthechancetofindcorrecttrees AT vonreumontbjornmarcus selectinginformativesubsetsofsparsesupermatricesincreasesthechancetofindcorrecttrees AT kuckpatrick selectinginformativesubsetsofsparsesupermatricesincreasesthechancetofindcorrecttrees AT misofkatharina selectinginformativesubsetsofsparsesupermatricesincreasesthechancetofindcorrecttrees AT meusemannkaren selectinginformativesubsetsofsparsesupermatricesincreasesthechancetofindcorrecttrees

Selecting informative subsets of sparse supermatrices increases the chance to find correct trees

Ejemplares similares