Cargando…
LS(X): automated reduction of gene-specific lineage evolutionary rate heterogeneity for multi-gene phylogeny inference
BACKGROUND: Lineage rate heterogeneity can be a major source of bias, especially in multi-gene phylogeny inference. We had previously tackled this issue by developing LS(3), a data subselection algorithm that, by removing fast-evolving sequences in a gene-specific manner, identifies subsets of seque...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6693147/ https://www.ncbi.nlm.nih.gov/pubmed/31409290 http://dx.doi.org/10.1186/s12859-019-3020-1 |
Sumario: | BACKGROUND: Lineage rate heterogeneity can be a major source of bias, especially in multi-gene phylogeny inference. We had previously tackled this issue by developing LS(3), a data subselection algorithm that, by removing fast-evolving sequences in a gene-specific manner, identifies subsets of sequences that evolve at a relatively homogeneous rate. However, this algorithm had two major shortcomings: (i) it was automated and published as a set of bash scripts, and hence was Linux-specific, and not user friendly, and (ii) it could result in very stringent sequence subselection when extremely slow-evolving sequences were present. RESULTS: We address these challenges and produce a new, platform-independent program, LS(X), written in R, which includes a reprogrammed version of the original LS(3) algorithm and has added features to make better lineage rate calculations. In addition, we developed and included an alternative version of the algorithm, LS(4), which reduces lineage rate heterogeneity by detecting sequences that evolve too fast and sequences that evolve too slow, resulting in less stringent data subselection when extremely slow-evolving sequences are present. The efficiency of LS(X) and of LS(4) with datasets with extremely slow-evolving sequences is demonstrated with simulated data, and by the resolution of a contentious node in the catfish phylogeny that was affected by an unusually high lineage rate heterogeneity in the dataset. CONCLUSIONS: LS(X) is a new bioinformatic tool, with an accessible code, and with which the effect of lineage rate heterogeneity can be explored in gene sequence datasets of virtually any size. In addition, the two modalities of the sequence subsampling algorithm included, LS(3) and LS(4), allow the user to optimize the amount of non-phylogenetic signal removed while keeping a maximum of phylogenetic signal. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-019-3020-1) contains supplementary material, which is available to authorized users. |
---|