Cargando…

Homology-Aware Phylogenomics at Gigabase Scales

Obstacles to inferring species trees from whole genome data sets range from algorithmic and data management challenges to the wholesale discordance in evolutionary history found in different parts of a genome. Recent work that builds trees directly from genomes by parsing them into sets of small [Fo...

Descripción completa

Detalles Bibliográficos
Autores principales: Sanderson, M. J., Nicolae, Marius, McMahon, M. M.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5790135/
https://www.ncbi.nlm.nih.gov/pubmed/28123115
http://dx.doi.org/10.1093/sysbio/syw104
_version_ 1783296407425253376
author Sanderson, M. J.
Nicolae, Marius
McMahon, M. M.
author_facet Sanderson, M. J.
Nicolae, Marius
McMahon, M. M.
author_sort Sanderson, M. J.
collection PubMed
description Obstacles to inferring species trees from whole genome data sets range from algorithmic and data management challenges to the wholesale discordance in evolutionary history found in different parts of a genome. Recent work that builds trees directly from genomes by parsing them into sets of small [Formula: see text]-mer strings holds promise to streamline and simplify these efforts, but existing approaches do not account well for gene tree discordance. We describe a “seed and extend” protocol that finds nearly exact matching sets of orthologous [Formula: see text]-mers and extends them to construct data sets that can properly account for genomic heterogeneity. Exploiting an efficient suffix array data structure, sets of whole genomes can be parsed and converted into phylogenetic data matrices rapidly, with contiguous blocks of [Formula: see text]-mers from the same chromosome, gene, or scaffold concatenated as needed. Phylogenetic trees constructed from highly curated rice genome data and a diverse set of six other eukaryotic whole genome, transcriptome, and organellar genome data sets recovered trees nearly identical to published phylogenomic analyses, in a small fraction of the time, and requiring many fewer parameter choices. Our method’s ability to retain local homology information was demonstrated by using it to characterize gene tree discordance across the rice genome, and by its robustness to the high rate of interchromosomal gene transfer found in several rice species.
format Online
Article
Text
id pubmed-5790135
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-57901352018-02-05 Homology-Aware Phylogenomics at Gigabase Scales Sanderson, M. J. Nicolae, Marius McMahon, M. M. Syst Biol Regular Articles Obstacles to inferring species trees from whole genome data sets range from algorithmic and data management challenges to the wholesale discordance in evolutionary history found in different parts of a genome. Recent work that builds trees directly from genomes by parsing them into sets of small [Formula: see text]-mer strings holds promise to streamline and simplify these efforts, but existing approaches do not account well for gene tree discordance. We describe a “seed and extend” protocol that finds nearly exact matching sets of orthologous [Formula: see text]-mers and extends them to construct data sets that can properly account for genomic heterogeneity. Exploiting an efficient suffix array data structure, sets of whole genomes can be parsed and converted into phylogenetic data matrices rapidly, with contiguous blocks of [Formula: see text]-mers from the same chromosome, gene, or scaffold concatenated as needed. Phylogenetic trees constructed from highly curated rice genome data and a diverse set of six other eukaryotic whole genome, transcriptome, and organellar genome data sets recovered trees nearly identical to published phylogenomic analyses, in a small fraction of the time, and requiring many fewer parameter choices. Our method’s ability to retain local homology information was demonstrated by using it to characterize gene tree discordance across the rice genome, and by its robustness to the high rate of interchromosomal gene transfer found in several rice species. Oxford University Press 2017-07 2017-01-25 /pmc/articles/PMC5790135/ /pubmed/28123115 http://dx.doi.org/10.1093/sysbio/syw104 Text en © The Author(s) 2017. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/ (https://creativecommons.org/licenses/by-nc/4.0/) ), which permits non-commercial reuse, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Regular Articles
Sanderson, M. J.
Nicolae, Marius
McMahon, M. M.
Homology-Aware Phylogenomics at Gigabase Scales
title Homology-Aware Phylogenomics at Gigabase Scales
title_full Homology-Aware Phylogenomics at Gigabase Scales
title_fullStr Homology-Aware Phylogenomics at Gigabase Scales
title_full_unstemmed Homology-Aware Phylogenomics at Gigabase Scales
title_short Homology-Aware Phylogenomics at Gigabase Scales
title_sort homology-aware phylogenomics at gigabase scales
topic Regular Articles
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5790135/
https://www.ncbi.nlm.nih.gov/pubmed/28123115
http://dx.doi.org/10.1093/sysbio/syw104
work_keys_str_mv AT sandersonmj homologyawarephylogenomicsatgigabasescales
AT nicolaemarius homologyawarephylogenomicsatgigabasescales
AT mcmahonmm homologyawarephylogenomicsatgigabasescales