Cargando…

Inadvertent Paralog Inclusion Drives Artifactual Topologies and Timetree Estimates in Phylogenomics

Increasingly, large phylogenomic data sets include transcriptomic data from nonmodel organisms. This not only has allowed controversial and unexplored evolutionary relationships in the tree of life to be addressed but also increases the risk of inadvertent inclusion of paralogs in the analysis. Alth...

Descripción completa

Detalles Bibliográficos
Autores principales: Siu-Ting, Karen, Torres-Sánchez, María, San Mauro, Diego, Wilcockson, David, Wilkinson, Mark, Pisani, Davide, O’Connell, Mary J, Creevey, Christopher J
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6526904/
https://www.ncbi.nlm.nih.gov/pubmed/30903171
http://dx.doi.org/10.1093/molbev/msz067
_version_ 1783419965654695936
author Siu-Ting, Karen
Torres-Sánchez, María
San Mauro, Diego
Wilcockson, David
Wilkinson, Mark
Pisani, Davide
O’Connell, Mary J
Creevey, Christopher J
author_facet Siu-Ting, Karen
Torres-Sánchez, María
San Mauro, Diego
Wilcockson, David
Wilkinson, Mark
Pisani, Davide
O’Connell, Mary J
Creevey, Christopher J
author_sort Siu-Ting, Karen
collection PubMed
description Increasingly, large phylogenomic data sets include transcriptomic data from nonmodel organisms. This not only has allowed controversial and unexplored evolutionary relationships in the tree of life to be addressed but also increases the risk of inadvertent inclusion of paralogs in the analysis. Although this may be expected to result in decreased phylogenetic support, it is not clear if it could also drive highly supported artifactual relationships. Many groups, including the hyperdiverse Lissamphibia, are especially susceptible to these issues due to ancient gene duplication events and small numbers of sequenced genomes and because transcriptomes are increasingly applied to resolve historically conflicting taxonomic hypotheses. We tested the potential impact of paralog inclusion on the topologies and timetree estimates of the Lissamphibia using published and de novo sequencing data including 18 amphibian species, from which 2,656 single-copy gene families were identified. A novel paralog filtering approach resulted in four differently curated data sets, which were used for phylogenetic reconstructions using Bayesian inference, maximum likelihood, and quartet-based supertrees. We found that paralogs drive strongly supported conflicting hypotheses within the Lissamphibia (Batrachia and Procera) and older divergence time estimates even within groups where no variation in topology was observed. All investigated methods, except Bayesian inference with the CAT-GTR model, were found to be sensitive to paralogs, but with filtering convergence to the same answer (Batrachia) was observed. This is the first large-scale study to address the impact of orthology selection using transcriptomic data and emphasizes the importance of quality over quantity particularly for understanding relationships of poorly sampled taxa.
format Online
Article
Text
id pubmed-6526904
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-65269042019-05-28 Inadvertent Paralog Inclusion Drives Artifactual Topologies and Timetree Estimates in Phylogenomics Siu-Ting, Karen Torres-Sánchez, María San Mauro, Diego Wilcockson, David Wilkinson, Mark Pisani, Davide O’Connell, Mary J Creevey, Christopher J Mol Biol Evol Methods Increasingly, large phylogenomic data sets include transcriptomic data from nonmodel organisms. This not only has allowed controversial and unexplored evolutionary relationships in the tree of life to be addressed but also increases the risk of inadvertent inclusion of paralogs in the analysis. Although this may be expected to result in decreased phylogenetic support, it is not clear if it could also drive highly supported artifactual relationships. Many groups, including the hyperdiverse Lissamphibia, are especially susceptible to these issues due to ancient gene duplication events and small numbers of sequenced genomes and because transcriptomes are increasingly applied to resolve historically conflicting taxonomic hypotheses. We tested the potential impact of paralog inclusion on the topologies and timetree estimates of the Lissamphibia using published and de novo sequencing data including 18 amphibian species, from which 2,656 single-copy gene families were identified. A novel paralog filtering approach resulted in four differently curated data sets, which were used for phylogenetic reconstructions using Bayesian inference, maximum likelihood, and quartet-based supertrees. We found that paralogs drive strongly supported conflicting hypotheses within the Lissamphibia (Batrachia and Procera) and older divergence time estimates even within groups where no variation in topology was observed. All investigated methods, except Bayesian inference with the CAT-GTR model, were found to be sensitive to paralogs, but with filtering convergence to the same answer (Batrachia) was observed. This is the first large-scale study to address the impact of orthology selection using transcriptomic data and emphasizes the importance of quality over quantity particularly for understanding relationships of poorly sampled taxa. Oxford University Press 2019-06 2019-03-23 /pmc/articles/PMC6526904/ /pubmed/30903171 http://dx.doi.org/10.1093/molbev/msz067 Text en © The Author(s) 2019. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methods
Siu-Ting, Karen
Torres-Sánchez, María
San Mauro, Diego
Wilcockson, David
Wilkinson, Mark
Pisani, Davide
O’Connell, Mary J
Creevey, Christopher J
Inadvertent Paralog Inclusion Drives Artifactual Topologies and Timetree Estimates in Phylogenomics
title Inadvertent Paralog Inclusion Drives Artifactual Topologies and Timetree Estimates in Phylogenomics
title_full Inadvertent Paralog Inclusion Drives Artifactual Topologies and Timetree Estimates in Phylogenomics
title_fullStr Inadvertent Paralog Inclusion Drives Artifactual Topologies and Timetree Estimates in Phylogenomics
title_full_unstemmed Inadvertent Paralog Inclusion Drives Artifactual Topologies and Timetree Estimates in Phylogenomics
title_short Inadvertent Paralog Inclusion Drives Artifactual Topologies and Timetree Estimates in Phylogenomics
title_sort inadvertent paralog inclusion drives artifactual topologies and timetree estimates in phylogenomics
topic Methods
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6526904/
https://www.ncbi.nlm.nih.gov/pubmed/30903171
http://dx.doi.org/10.1093/molbev/msz067
work_keys_str_mv AT siutingkaren inadvertentparaloginclusiondrivesartifactualtopologiesandtimetreeestimatesinphylogenomics
AT torressanchezmaria inadvertentparaloginclusiondrivesartifactualtopologiesandtimetreeestimatesinphylogenomics
AT sanmaurodiego inadvertentparaloginclusiondrivesartifactualtopologiesandtimetreeestimatesinphylogenomics
AT wilcocksondavid inadvertentparaloginclusiondrivesartifactualtopologiesandtimetreeestimatesinphylogenomics
AT wilkinsonmark inadvertentparaloginclusiondrivesartifactualtopologiesandtimetreeestimatesinphylogenomics
AT pisanidavide inadvertentparaloginclusiondrivesartifactualtopologiesandtimetreeestimatesinphylogenomics
AT oconnellmaryj inadvertentparaloginclusiondrivesartifactualtopologiesandtimetreeestimatesinphylogenomics
AT creeveychristopherj inadvertentparaloginclusiondrivesartifactualtopologiesandtimetreeestimatesinphylogenomics