Cargando…

Phylogenomics with incomplete taxon coverage: the limits to inference

BACKGROUND: Phylogenomic studies based on multi-locus sequence data sets are usually characterized by partial taxon coverage, in which sequences for some loci are missing for some taxa. The impact of missing data has been widely studied in phylogenetics, but it has proven difficult to distinguish ef...

Descripción completa

Detalles Bibliográficos
Autores principales: Sanderson, Michael J, McMahon, Michelle M, Steel, Mike
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2010
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2897806/
https://www.ncbi.nlm.nih.gov/pubmed/20500873
http://dx.doi.org/10.1186/1471-2148-10-155
_version_ 1782183439189934080
author Sanderson, Michael J
McMahon, Michelle M
Steel, Mike
author_facet Sanderson, Michael J
McMahon, Michelle M
Steel, Mike
author_sort Sanderson, Michael J
collection PubMed
description BACKGROUND: Phylogenomic studies based on multi-locus sequence data sets are usually characterized by partial taxon coverage, in which sequences for some loci are missing for some taxa. The impact of missing data has been widely studied in phylogenetics, but it has proven difficult to distinguish effects due to error in tree reconstruction from effects due to missing data per se. We approach this problem using a explicitly phylogenomic criterion of success, decisiveness, which refers to whether the pattern of taxon coverage allows for uniquely defining a single tree for all taxa. RESULTS: We establish theoretical bounds on the impact of missing data on decisiveness. Results are derived for two contexts: a fixed taxon coverage pattern, such as that observed from an already assembled data set, and a randomly generated pattern derived from a process of sampling new data, such as might be observed in an ongoing comparative genomics sequencing project. Lower bounds on how many loci are needed for decisiveness are derived for the former case, and both lower and upper bounds for the latter. When data are not decisive for all trees, we estimate the probability of decisiveness and the chances that a given edge in the tree will be distinguishable. Theoretical results are illustrated using several empirical examples constructed by mining sequence databases, genomic libraries such as ESTs and BACs, and complete genome sequences. CONCLUSION: Partial taxon coverage among loci can limit phylogenomic inference by making it impossible to distinguish among multiple alternative trees. However, even though lack of decisiveness is typical of many sparse phylogenomic data sets, it is often still possible to distinguish a large fraction of edges in the tree.
format Text
id pubmed-2897806
institution National Center for Biotechnology Information
language English
publishDate 2010
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-28978062010-07-07 Phylogenomics with incomplete taxon coverage: the limits to inference Sanderson, Michael J McMahon, Michelle M Steel, Mike BMC Evol Biol Research article BACKGROUND: Phylogenomic studies based on multi-locus sequence data sets are usually characterized by partial taxon coverage, in which sequences for some loci are missing for some taxa. The impact of missing data has been widely studied in phylogenetics, but it has proven difficult to distinguish effects due to error in tree reconstruction from effects due to missing data per se. We approach this problem using a explicitly phylogenomic criterion of success, decisiveness, which refers to whether the pattern of taxon coverage allows for uniquely defining a single tree for all taxa. RESULTS: We establish theoretical bounds on the impact of missing data on decisiveness. Results are derived for two contexts: a fixed taxon coverage pattern, such as that observed from an already assembled data set, and a randomly generated pattern derived from a process of sampling new data, such as might be observed in an ongoing comparative genomics sequencing project. Lower bounds on how many loci are needed for decisiveness are derived for the former case, and both lower and upper bounds for the latter. When data are not decisive for all trees, we estimate the probability of decisiveness and the chances that a given edge in the tree will be distinguishable. Theoretical results are illustrated using several empirical examples constructed by mining sequence databases, genomic libraries such as ESTs and BACs, and complete genome sequences. CONCLUSION: Partial taxon coverage among loci can limit phylogenomic inference by making it impossible to distinguish among multiple alternative trees. However, even though lack of decisiveness is typical of many sparse phylogenomic data sets, it is often still possible to distinguish a large fraction of edges in the tree. BioMed Central 2010-05-25 /pmc/articles/PMC2897806/ /pubmed/20500873 http://dx.doi.org/10.1186/1471-2148-10-155 Text en Copyright ©2010 Sanderson et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research article
Sanderson, Michael J
McMahon, Michelle M
Steel, Mike
Phylogenomics with incomplete taxon coverage: the limits to inference
title Phylogenomics with incomplete taxon coverage: the limits to inference
title_full Phylogenomics with incomplete taxon coverage: the limits to inference
title_fullStr Phylogenomics with incomplete taxon coverage: the limits to inference
title_full_unstemmed Phylogenomics with incomplete taxon coverage: the limits to inference
title_short Phylogenomics with incomplete taxon coverage: the limits to inference
title_sort phylogenomics with incomplete taxon coverage: the limits to inference
topic Research article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2897806/
https://www.ncbi.nlm.nih.gov/pubmed/20500873
http://dx.doi.org/10.1186/1471-2148-10-155
work_keys_str_mv AT sandersonmichaelj phylogenomicswithincompletetaxoncoveragethelimitstoinference
AT mcmahonmichellem phylogenomicswithincompletetaxoncoveragethelimitstoinference
AT steelmike phylogenomicswithincompletetaxoncoveragethelimitstoinference