Cargando…

PhylteR: Efficient Identification of Outlier Sequences in Phylogenomic Datasets

In phylogenomics, incongruences between gene trees, resulting from both artifactual and biological reasons, can decrease the signal-to-noise ratio and complicate species tree inference. The amount of data handled today in classical phylogenomic analyses precludes manual error detection and removal....

Descripción completa

Detalles Bibliográficos
Autores principales: Comte, Aurore, Tricou, Théo, Tannier, Eric, Joseph, Julien, Siberchicot, Aurélie, Penel, Simon, Allio, Rémi, Delsuc, Frédéric, Dray, Stéphane, de Vienne, Damien M
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10655845/
https://www.ncbi.nlm.nih.gov/pubmed/37879113
http://dx.doi.org/10.1093/molbev/msad234
_version_ 1785136899217686528
author Comte, Aurore
Tricou, Théo
Tannier, Eric
Joseph, Julien
Siberchicot, Aurélie
Penel, Simon
Allio, Rémi
Delsuc, Frédéric
Dray, Stéphane
de Vienne, Damien M
author_facet Comte, Aurore
Tricou, Théo
Tannier, Eric
Joseph, Julien
Siberchicot, Aurélie
Penel, Simon
Allio, Rémi
Delsuc, Frédéric
Dray, Stéphane
de Vienne, Damien M
author_sort Comte, Aurore
collection PubMed
description In phylogenomics, incongruences between gene trees, resulting from both artifactual and biological reasons, can decrease the signal-to-noise ratio and complicate species tree inference. The amount of data handled today in classical phylogenomic analyses precludes manual error detection and removal. However, a simple and efficient way to automate the identification of outliers from a collection of gene trees is still missing. Here, we present PhylteR, a method that allows rapid and accurate detection of outlier sequences in phylogenomic datasets, i.e. species from individual gene trees that do not follow the general trend. PhylteR relies on DISTATIS, an extension of multidimensional scaling to 3 dimensions to compare multiple distance matrices at once. In PhylteR, these distance matrices extracted from individual gene phylogenies represent evolutionary distances between species according to each gene. On simulated datasets, we show that PhylteR identifies outliers with more sensitivity and precision than a comparable existing method. We also show that PhylteR is not sensitive to ILS-induced incongruences, which is a desirable feature. On a biological dataset of 14,463 genes for 53 species previously assembled for Carnivora phylogenomics, we show (i) that PhylteR identifies as outliers sequences that can be considered as such by other means, and (ii) that the removal of these sequences improves the concordance between the gene trees and the species tree. Thanks to the generation of numerous graphical outputs, PhylteR also allows for the rapid and easy visual characterization of the dataset at hand, thus aiding in the precise identification of errors. PhylteR is distributed as an R package on CRAN and as containerized versions (docker and singularity).
format Online
Article
Text
id pubmed-10655845
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-106558452023-10-25 PhylteR: Efficient Identification of Outlier Sequences in Phylogenomic Datasets Comte, Aurore Tricou, Théo Tannier, Eric Joseph, Julien Siberchicot, Aurélie Penel, Simon Allio, Rémi Delsuc, Frédéric Dray, Stéphane de Vienne, Damien M Mol Biol Evol Methods In phylogenomics, incongruences between gene trees, resulting from both artifactual and biological reasons, can decrease the signal-to-noise ratio and complicate species tree inference. The amount of data handled today in classical phylogenomic analyses precludes manual error detection and removal. However, a simple and efficient way to automate the identification of outliers from a collection of gene trees is still missing. Here, we present PhylteR, a method that allows rapid and accurate detection of outlier sequences in phylogenomic datasets, i.e. species from individual gene trees that do not follow the general trend. PhylteR relies on DISTATIS, an extension of multidimensional scaling to 3 dimensions to compare multiple distance matrices at once. In PhylteR, these distance matrices extracted from individual gene phylogenies represent evolutionary distances between species according to each gene. On simulated datasets, we show that PhylteR identifies outliers with more sensitivity and precision than a comparable existing method. We also show that PhylteR is not sensitive to ILS-induced incongruences, which is a desirable feature. On a biological dataset of 14,463 genes for 53 species previously assembled for Carnivora phylogenomics, we show (i) that PhylteR identifies as outliers sequences that can be considered as such by other means, and (ii) that the removal of these sequences improves the concordance between the gene trees and the species tree. Thanks to the generation of numerous graphical outputs, PhylteR also allows for the rapid and easy visual characterization of the dataset at hand, thus aiding in the precise identification of errors. PhylteR is distributed as an R package on CRAN and as containerized versions (docker and singularity). Oxford University Press 2023-10-25 /pmc/articles/PMC10655845/ /pubmed/37879113 http://dx.doi.org/10.1093/molbev/msad234 Text en © The Author(s) 2023. Published by Oxford University Press on behalf of Society for Molecular Biology and Evolution. https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Methods
Comte, Aurore
Tricou, Théo
Tannier, Eric
Joseph, Julien
Siberchicot, Aurélie
Penel, Simon
Allio, Rémi
Delsuc, Frédéric
Dray, Stéphane
de Vienne, Damien M
PhylteR: Efficient Identification of Outlier Sequences in Phylogenomic Datasets
title PhylteR: Efficient Identification of Outlier Sequences in Phylogenomic Datasets
title_full PhylteR: Efficient Identification of Outlier Sequences in Phylogenomic Datasets
title_fullStr PhylteR: Efficient Identification of Outlier Sequences in Phylogenomic Datasets
title_full_unstemmed PhylteR: Efficient Identification of Outlier Sequences in Phylogenomic Datasets
title_short PhylteR: Efficient Identification of Outlier Sequences in Phylogenomic Datasets
title_sort phylter: efficient identification of outlier sequences in phylogenomic datasets
topic Methods
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10655845/
https://www.ncbi.nlm.nih.gov/pubmed/37879113
http://dx.doi.org/10.1093/molbev/msad234
work_keys_str_mv AT comteaurore phylterefficientidentificationofoutliersequencesinphylogenomicdatasets
AT tricoutheo phylterefficientidentificationofoutliersequencesinphylogenomicdatasets
AT tanniereric phylterefficientidentificationofoutliersequencesinphylogenomicdatasets
AT josephjulien phylterefficientidentificationofoutliersequencesinphylogenomicdatasets
AT siberchicotaurelie phylterefficientidentificationofoutliersequencesinphylogenomicdatasets
AT penelsimon phylterefficientidentificationofoutliersequencesinphylogenomicdatasets
AT allioremi phylterefficientidentificationofoutliersequencesinphylogenomicdatasets
AT delsucfrederic phylterefficientidentificationofoutliersequencesinphylogenomicdatasets
AT draystephane phylterefficientidentificationofoutliersequencesinphylogenomicdatasets
AT deviennedamienm phylterefficientidentificationofoutliersequencesinphylogenomicdatasets