Cargando…

Phylogenetic Profiling: How Much Input Data Is Enough?

Phylogenetic profiling is a well-established approach for predicting gene function based on patterns of gene presence and absence across species. Much of the recent developments have focused on methodological improvements, but relatively little is known about the effect of input data size on the qua...

Descripción completa

Detalles Bibliográficos
Autores principales: Škunca, Nives, Dessimoz, Christophe
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4332489/
https://www.ncbi.nlm.nih.gov/pubmed/25679783
http://dx.doi.org/10.1371/journal.pone.0114701
_version_ 1782357922983968768
author Škunca, Nives
Dessimoz, Christophe
author_facet Škunca, Nives
Dessimoz, Christophe
author_sort Škunca, Nives
collection PubMed
description Phylogenetic profiling is a well-established approach for predicting gene function based on patterns of gene presence and absence across species. Much of the recent developments have focused on methodological improvements, but relatively little is known about the effect of input data size on the quality of predictions. In this work, we ask: how many genomes and functional annotations need to be considered for phylogenetic profiling to be effective? Phylogenetic profiling generally benefits from an increased amount of input data. However, by decomposing this improvement in predictive accuracy in terms of the contribution of additional genomes and of additional annotations, we observed diminishing returns in adding more than ∼100 genomes, whereas increasing the number of annotations remained strongly beneficial throughout. We also observed that maximising phylogenetic diversity within a clade of interest improves predictive accuracy, but the effect is small compared to changes in the number of genomes under comparison. Finally, we show that these findings are supported in light of the Open World Assumption, which posits that functional annotation databases are inherently incomplete. All the tools and data used in this work are available for reuse from http://lab.dessimoz.org/14_phylprof. Scripts used to analyse the data are available on request from the authors.
format Online
Article
Text
id pubmed-4332489
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-43324892015-02-24 Phylogenetic Profiling: How Much Input Data Is Enough? Škunca, Nives Dessimoz, Christophe PLoS One Research Article Phylogenetic profiling is a well-established approach for predicting gene function based on patterns of gene presence and absence across species. Much of the recent developments have focused on methodological improvements, but relatively little is known about the effect of input data size on the quality of predictions. In this work, we ask: how many genomes and functional annotations need to be considered for phylogenetic profiling to be effective? Phylogenetic profiling generally benefits from an increased amount of input data. However, by decomposing this improvement in predictive accuracy in terms of the contribution of additional genomes and of additional annotations, we observed diminishing returns in adding more than ∼100 genomes, whereas increasing the number of annotations remained strongly beneficial throughout. We also observed that maximising phylogenetic diversity within a clade of interest improves predictive accuracy, but the effect is small compared to changes in the number of genomes under comparison. Finally, we show that these findings are supported in light of the Open World Assumption, which posits that functional annotation databases are inherently incomplete. All the tools and data used in this work are available for reuse from http://lab.dessimoz.org/14_phylprof. Scripts used to analyse the data are available on request from the authors. Public Library of Science 2015-02-13 /pmc/articles/PMC4332489/ /pubmed/25679783 http://dx.doi.org/10.1371/journal.pone.0114701 Text en © 2015 Škunca, Dessimoz http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Škunca, Nives
Dessimoz, Christophe
Phylogenetic Profiling: How Much Input Data Is Enough?
title Phylogenetic Profiling: How Much Input Data Is Enough?
title_full Phylogenetic Profiling: How Much Input Data Is Enough?
title_fullStr Phylogenetic Profiling: How Much Input Data Is Enough?
title_full_unstemmed Phylogenetic Profiling: How Much Input Data Is Enough?
title_short Phylogenetic Profiling: How Much Input Data Is Enough?
title_sort phylogenetic profiling: how much input data is enough?
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4332489/
https://www.ncbi.nlm.nih.gov/pubmed/25679783
http://dx.doi.org/10.1371/journal.pone.0114701
work_keys_str_mv AT skuncanives phylogeneticprofilinghowmuchinputdataisenough
AT dessimozchristophe phylogeneticprofilinghowmuchinputdataisenough