Cargando…
Homology-Aware Phylogenomics at Gigabase Scales
Obstacles to inferring species trees from whole genome data sets range from algorithmic and data management challenges to the wholesale discordance in evolutionary history found in different parts of a genome. Recent work that builds trees directly from genomes by parsing them into sets of small [Fo...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2017
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5790135/ https://www.ncbi.nlm.nih.gov/pubmed/28123115 http://dx.doi.org/10.1093/sysbio/syw104 |
_version_ | 1783296407425253376 |
---|---|
author | Sanderson, M. J. Nicolae, Marius McMahon, M. M. |
author_facet | Sanderson, M. J. Nicolae, Marius McMahon, M. M. |
author_sort | Sanderson, M. J. |
collection | PubMed |
description | Obstacles to inferring species trees from whole genome data sets range from algorithmic and data management challenges to the wholesale discordance in evolutionary history found in different parts of a genome. Recent work that builds trees directly from genomes by parsing them into sets of small [Formula: see text]-mer strings holds promise to streamline and simplify these efforts, but existing approaches do not account well for gene tree discordance. We describe a “seed and extend” protocol that finds nearly exact matching sets of orthologous [Formula: see text]-mers and extends them to construct data sets that can properly account for genomic heterogeneity. Exploiting an efficient suffix array data structure, sets of whole genomes can be parsed and converted into phylogenetic data matrices rapidly, with contiguous blocks of [Formula: see text]-mers from the same chromosome, gene, or scaffold concatenated as needed. Phylogenetic trees constructed from highly curated rice genome data and a diverse set of six other eukaryotic whole genome, transcriptome, and organellar genome data sets recovered trees nearly identical to published phylogenomic analyses, in a small fraction of the time, and requiring many fewer parameter choices. Our method’s ability to retain local homology information was demonstrated by using it to characterize gene tree discordance across the rice genome, and by its robustness to the high rate of interchromosomal gene transfer found in several rice species. |
format | Online Article Text |
id | pubmed-5790135 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2017 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-57901352018-02-05 Homology-Aware Phylogenomics at Gigabase Scales Sanderson, M. J. Nicolae, Marius McMahon, M. M. Syst Biol Regular Articles Obstacles to inferring species trees from whole genome data sets range from algorithmic and data management challenges to the wholesale discordance in evolutionary history found in different parts of a genome. Recent work that builds trees directly from genomes by parsing them into sets of small [Formula: see text]-mer strings holds promise to streamline and simplify these efforts, but existing approaches do not account well for gene tree discordance. We describe a “seed and extend” protocol that finds nearly exact matching sets of orthologous [Formula: see text]-mers and extends them to construct data sets that can properly account for genomic heterogeneity. Exploiting an efficient suffix array data structure, sets of whole genomes can be parsed and converted into phylogenetic data matrices rapidly, with contiguous blocks of [Formula: see text]-mers from the same chromosome, gene, or scaffold concatenated as needed. Phylogenetic trees constructed from highly curated rice genome data and a diverse set of six other eukaryotic whole genome, transcriptome, and organellar genome data sets recovered trees nearly identical to published phylogenomic analyses, in a small fraction of the time, and requiring many fewer parameter choices. Our method’s ability to retain local homology information was demonstrated by using it to characterize gene tree discordance across the rice genome, and by its robustness to the high rate of interchromosomal gene transfer found in several rice species. Oxford University Press 2017-07 2017-01-25 /pmc/articles/PMC5790135/ /pubmed/28123115 http://dx.doi.org/10.1093/sysbio/syw104 Text en © The Author(s) 2017. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/ (https://creativecommons.org/licenses/by-nc/4.0/) ), which permits non-commercial reuse, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com |
spellingShingle | Regular Articles Sanderson, M. J. Nicolae, Marius McMahon, M. M. Homology-Aware Phylogenomics at Gigabase Scales |
title | Homology-Aware Phylogenomics at Gigabase Scales |
title_full | Homology-Aware Phylogenomics at Gigabase Scales |
title_fullStr | Homology-Aware Phylogenomics at Gigabase Scales |
title_full_unstemmed | Homology-Aware Phylogenomics at Gigabase Scales |
title_short | Homology-Aware Phylogenomics at Gigabase Scales |
title_sort | homology-aware phylogenomics at gigabase scales |
topic | Regular Articles |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5790135/ https://www.ncbi.nlm.nih.gov/pubmed/28123115 http://dx.doi.org/10.1093/sysbio/syw104 |
work_keys_str_mv | AT sandersonmj homologyawarephylogenomicsatgigabasescales AT nicolaemarius homologyawarephylogenomicsatgigabasescales AT mcmahonmm homologyawarephylogenomicsatgigabasescales |