Cargando…

nRCFV: a new, dataset-size-independent metric to quantify compositional heterogeneity in nucleotide and amino acid datasets

MOTIVATION: Compositional heterogeneity—when the proportions of nucleotides and amino acids are not broadly similar across the dataset—is a cause of a great number of phylogenetic artefacts. Whilst a variety of methods can identify it post-hoc, few metrics exist to quantify compositional heterogenei...

Descripción completa

Detalles Bibliográficos
Autores principales: Fleming, James F., Struck, Torsten H.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10099917/
https://www.ncbi.nlm.nih.gov/pubmed/37046225
http://dx.doi.org/10.1186/s12859-023-05270-8
_version_ 1785025161814081536
author Fleming, James F.
Struck, Torsten H.
author_facet Fleming, James F.
Struck, Torsten H.
author_sort Fleming, James F.
collection PubMed
description MOTIVATION: Compositional heterogeneity—when the proportions of nucleotides and amino acids are not broadly similar across the dataset—is a cause of a great number of phylogenetic artefacts. Whilst a variety of methods can identify it post-hoc, few metrics exist to quantify compositional heterogeneity prior to the computationally intensive task of phylogenetic tree reconstruction. Here we assess the efficacy of one such existing, widely used, metric: Relative Composition Frequency Variability (RCFV), using both real and simulated data. RESULTS: Our results show that RCFV can be biased by sequence length, the number of taxa, and the number of possible character states within the dataset. However, we also find that missing data does not appear to have an appreciable effect on RCFV. We discuss the theory behind this, the consequences of this for the future of the usage of the RCFV value and propose a new metric, nRCFV, which accounts for these biases. Alongside this, we present a new software that calculates both RCFV and nRCFV, called nRCFV_Reader. AVAILABILITY AND IMPLEMENTATION: nRCFV has been implemented in RCFV_Reader, available at: https://github.com/JFFleming/RCFV_Reader. Both our simulation and real data are available at Datadryad: https://doi.org/10.5061/dryad.wpzgmsbpn.
format Online
Article
Text
id pubmed-10099917
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-100999172023-04-14 nRCFV: a new, dataset-size-independent metric to quantify compositional heterogeneity in nucleotide and amino acid datasets Fleming, James F. Struck, Torsten H. BMC Bioinformatics Research MOTIVATION: Compositional heterogeneity—when the proportions of nucleotides and amino acids are not broadly similar across the dataset—is a cause of a great number of phylogenetic artefacts. Whilst a variety of methods can identify it post-hoc, few metrics exist to quantify compositional heterogeneity prior to the computationally intensive task of phylogenetic tree reconstruction. Here we assess the efficacy of one such existing, widely used, metric: Relative Composition Frequency Variability (RCFV), using both real and simulated data. RESULTS: Our results show that RCFV can be biased by sequence length, the number of taxa, and the number of possible character states within the dataset. However, we also find that missing data does not appear to have an appreciable effect on RCFV. We discuss the theory behind this, the consequences of this for the future of the usage of the RCFV value and propose a new metric, nRCFV, which accounts for these biases. Alongside this, we present a new software that calculates both RCFV and nRCFV, called nRCFV_Reader. AVAILABILITY AND IMPLEMENTATION: nRCFV has been implemented in RCFV_Reader, available at: https://github.com/JFFleming/RCFV_Reader. Both our simulation and real data are available at Datadryad: https://doi.org/10.5061/dryad.wpzgmsbpn. BioMed Central 2023-04-12 /pmc/articles/PMC10099917/ /pubmed/37046225 http://dx.doi.org/10.1186/s12859-023-05270-8 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Fleming, James F.
Struck, Torsten H.
nRCFV: a new, dataset-size-independent metric to quantify compositional heterogeneity in nucleotide and amino acid datasets
title nRCFV: a new, dataset-size-independent metric to quantify compositional heterogeneity in nucleotide and amino acid datasets
title_full nRCFV: a new, dataset-size-independent metric to quantify compositional heterogeneity in nucleotide and amino acid datasets
title_fullStr nRCFV: a new, dataset-size-independent metric to quantify compositional heterogeneity in nucleotide and amino acid datasets
title_full_unstemmed nRCFV: a new, dataset-size-independent metric to quantify compositional heterogeneity in nucleotide and amino acid datasets
title_short nRCFV: a new, dataset-size-independent metric to quantify compositional heterogeneity in nucleotide and amino acid datasets
title_sort nrcfv: a new, dataset-size-independent metric to quantify compositional heterogeneity in nucleotide and amino acid datasets
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10099917/
https://www.ncbi.nlm.nih.gov/pubmed/37046225
http://dx.doi.org/10.1186/s12859-023-05270-8
work_keys_str_mv AT flemingjamesf nrcfvanewdatasetsizeindependentmetrictoquantifycompositionalheterogeneityinnucleotideandaminoaciddatasets
AT strucktorstenh nrcfvanewdatasetsizeindependentmetrictoquantifycompositionalheterogeneityinnucleotideandaminoaciddatasets