Cargando…

LS(X): automated reduction of gene-specific lineage evolutionary rate heterogeneity for multi-gene phylogeny inference

BACKGROUND: Lineage rate heterogeneity can be a major source of bias, especially in multi-gene phylogeny inference. We had previously tackled this issue by developing LS(3), a data subselection algorithm that, by removing fast-evolving sequences in a gene-specific manner, identifies subsets of seque...

Descripción completa

Detalles Bibliográficos
Autores principales: Rivera-Rivera, Carlos J., Montoya-Burgos, Juan I.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6693147/
https://www.ncbi.nlm.nih.gov/pubmed/31409290
http://dx.doi.org/10.1186/s12859-019-3020-1
_version_ 1783443652913135616
author Rivera-Rivera, Carlos J.
Montoya-Burgos, Juan I.
author_facet Rivera-Rivera, Carlos J.
Montoya-Burgos, Juan I.
author_sort Rivera-Rivera, Carlos J.
collection PubMed
description BACKGROUND: Lineage rate heterogeneity can be a major source of bias, especially in multi-gene phylogeny inference. We had previously tackled this issue by developing LS(3), a data subselection algorithm that, by removing fast-evolving sequences in a gene-specific manner, identifies subsets of sequences that evolve at a relatively homogeneous rate. However, this algorithm had two major shortcomings: (i) it was automated and published as a set of bash scripts, and hence was Linux-specific, and not user friendly, and (ii) it could result in very stringent sequence subselection when extremely slow-evolving sequences were present. RESULTS: We address these challenges and produce a new, platform-independent program, LS(X), written in R, which includes a reprogrammed version of the original LS(3) algorithm and has added features to make better lineage rate calculations. In addition, we developed and included an alternative version of the algorithm, LS(4), which reduces lineage rate heterogeneity by detecting sequences that evolve too fast and sequences that evolve too slow, resulting in less stringent data subselection when extremely slow-evolving sequences are present. The efficiency of LS(X) and of LS(4) with datasets with extremely slow-evolving sequences is demonstrated with simulated data, and by the resolution of a contentious node in the catfish phylogeny that was affected by an unusually high lineage rate heterogeneity in the dataset. CONCLUSIONS: LS(X) is a new bioinformatic tool, with an accessible code, and with which the effect of lineage rate heterogeneity can be explored in gene sequence datasets of virtually any size. In addition, the two modalities of the sequence subsampling algorithm included, LS(3) and LS(4), allow the user to optimize the amount of non-phylogenetic signal removed while keeping a maximum of phylogenetic signal. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-019-3020-1) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-6693147
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-66931472019-08-16 LS(X): automated reduction of gene-specific lineage evolutionary rate heterogeneity for multi-gene phylogeny inference Rivera-Rivera, Carlos J. Montoya-Burgos, Juan I. BMC Bioinformatics Software BACKGROUND: Lineage rate heterogeneity can be a major source of bias, especially in multi-gene phylogeny inference. We had previously tackled this issue by developing LS(3), a data subselection algorithm that, by removing fast-evolving sequences in a gene-specific manner, identifies subsets of sequences that evolve at a relatively homogeneous rate. However, this algorithm had two major shortcomings: (i) it was automated and published as a set of bash scripts, and hence was Linux-specific, and not user friendly, and (ii) it could result in very stringent sequence subselection when extremely slow-evolving sequences were present. RESULTS: We address these challenges and produce a new, platform-independent program, LS(X), written in R, which includes a reprogrammed version of the original LS(3) algorithm and has added features to make better lineage rate calculations. In addition, we developed and included an alternative version of the algorithm, LS(4), which reduces lineage rate heterogeneity by detecting sequences that evolve too fast and sequences that evolve too slow, resulting in less stringent data subselection when extremely slow-evolving sequences are present. The efficiency of LS(X) and of LS(4) with datasets with extremely slow-evolving sequences is demonstrated with simulated data, and by the resolution of a contentious node in the catfish phylogeny that was affected by an unusually high lineage rate heterogeneity in the dataset. CONCLUSIONS: LS(X) is a new bioinformatic tool, with an accessible code, and with which the effect of lineage rate heterogeneity can be explored in gene sequence datasets of virtually any size. In addition, the two modalities of the sequence subsampling algorithm included, LS(3) and LS(4), allow the user to optimize the amount of non-phylogenetic signal removed while keeping a maximum of phylogenetic signal. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-019-3020-1) contains supplementary material, which is available to authorized users. BioMed Central 2019-08-13 /pmc/articles/PMC6693147/ /pubmed/31409290 http://dx.doi.org/10.1186/s12859-019-3020-1 Text en © The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Software
Rivera-Rivera, Carlos J.
Montoya-Burgos, Juan I.
LS(X): automated reduction of gene-specific lineage evolutionary rate heterogeneity for multi-gene phylogeny inference
title LS(X): automated reduction of gene-specific lineage evolutionary rate heterogeneity for multi-gene phylogeny inference
title_full LS(X): automated reduction of gene-specific lineage evolutionary rate heterogeneity for multi-gene phylogeny inference
title_fullStr LS(X): automated reduction of gene-specific lineage evolutionary rate heterogeneity for multi-gene phylogeny inference
title_full_unstemmed LS(X): automated reduction of gene-specific lineage evolutionary rate heterogeneity for multi-gene phylogeny inference
title_short LS(X): automated reduction of gene-specific lineage evolutionary rate heterogeneity for multi-gene phylogeny inference
title_sort ls(x): automated reduction of gene-specific lineage evolutionary rate heterogeneity for multi-gene phylogeny inference
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6693147/
https://www.ncbi.nlm.nih.gov/pubmed/31409290
http://dx.doi.org/10.1186/s12859-019-3020-1
work_keys_str_mv AT riverariveracarlosj lsxautomatedreductionofgenespecificlineageevolutionaryrateheterogeneityformultigenephylogenyinference
AT montoyaburgosjuani lsxautomatedreductionofgenespecificlineageevolutionaryrateheterogeneityformultigenephylogenyinference