Cargando…

PhyloMissForest: a random forest framework to construct phylogenetic trees with missing data

BACKGROUND: In the pursuit of a better understanding of biodiversity, evolutionary biologists rely on the study of phylogenetic relationships to illustrate the course of evolution. The relationships among natural organisms, depicted in the shape of phylogenetic trees, not only help to understand evo...

Descripción completa

Detalles Bibliográficos
Autores principales:	Pinheiro, Diogo, Santander-Jimenéz, Sergio, Ilic, Aleksandar
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2022
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9116704/ https://www.ncbi.nlm.nih.gov/pubmed/35585494 http://dx.doi.org/10.1186/s12864-022-08540-6

_version_	1784710167407886336
author	Pinheiro, Diogo Santander-Jimenéz, Sergio Ilic, Aleksandar
author_facet	Pinheiro, Diogo Santander-Jimenéz, Sergio Ilic, Aleksandar
author_sort	Pinheiro, Diogo
collection	PubMed
description	BACKGROUND: In the pursuit of a better understanding of biodiversity, evolutionary biologists rely on the study of phylogenetic relationships to illustrate the course of evolution. The relationships among natural organisms, depicted in the shape of phylogenetic trees, not only help to understand evolutionary history but also have a wide range of additional applications in science. One of the most challenging problems that arise when building phylogenetic trees is the presence of missing biological data. More specifically, the possibility of inferring wrong phylogenetic trees increases proportionally to the amount of missing values in the input data. Although there are methods proposed to deal with this issue, their applicability and accuracy is often restricted by different constraints. RESULTS: We propose a framework, called PhyloMissForest, to impute missing entries in phylogenetic distance matrices and infer accurate evolutionary relationships. PhyloMissForest is built upon a random forest structure that infers the missing entries of the input data, based on the known parts of it. PhyloMissForest contributes with a robust and configurable framework that incorporates multiple search strategies and machine learning, complemented by phylogenetic techniques, to provide a more accurate inference of lost phylogenetic distances. We evaluate our framework by examining three real-world datasets, two DNA-based sequence alignments and one containing amino acid data, and two additional instances with simulated DNA data. Moreover, we follow a design of experiments methodology to define the hyperparameter values of our algorithm, which is a concise method, preferable in comparison to the well-known exhaustive parameters search. By varying the percentages of missing data from 5% to 60%, we generally outperform the state-of-the-art alternative imputation techniques in the tests conducted on real DNA data. In addition, significant improvements in execution time are observed for the amino acid instance. The results observed on simulated data also denote the attainment of improved imputations when dealing with large percentages of missing data. CONCLUSIONS: By merging multiple search strategies, machine learning, and phylogenetic techniques, PhyloMissForest provides a highly customizable and robust framework for phylogenetic missing data imputation, with significant topological accuracy and effective speedups over the state of the art. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at (10.1186/s12864-022-08540-6).
format	Online Article Text
id	pubmed-9116704
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-91167042022-05-19 PhyloMissForest: a random forest framework to construct phylogenetic trees with missing data Pinheiro, Diogo Santander-Jimenéz, Sergio Ilic, Aleksandar BMC Genomics Research BACKGROUND: In the pursuit of a better understanding of biodiversity, evolutionary biologists rely on the study of phylogenetic relationships to illustrate the course of evolution. The relationships among natural organisms, depicted in the shape of phylogenetic trees, not only help to understand evolutionary history but also have a wide range of additional applications in science. One of the most challenging problems that arise when building phylogenetic trees is the presence of missing biological data. More specifically, the possibility of inferring wrong phylogenetic trees increases proportionally to the amount of missing values in the input data. Although there are methods proposed to deal with this issue, their applicability and accuracy is often restricted by different constraints. RESULTS: We propose a framework, called PhyloMissForest, to impute missing entries in phylogenetic distance matrices and infer accurate evolutionary relationships. PhyloMissForest is built upon a random forest structure that infers the missing entries of the input data, based on the known parts of it. PhyloMissForest contributes with a robust and configurable framework that incorporates multiple search strategies and machine learning, complemented by phylogenetic techniques, to provide a more accurate inference of lost phylogenetic distances. We evaluate our framework by examining three real-world datasets, two DNA-based sequence alignments and one containing amino acid data, and two additional instances with simulated DNA data. Moreover, we follow a design of experiments methodology to define the hyperparameter values of our algorithm, which is a concise method, preferable in comparison to the well-known exhaustive parameters search. By varying the percentages of missing data from 5% to 60%, we generally outperform the state-of-the-art alternative imputation techniques in the tests conducted on real DNA data. In addition, significant improvements in execution time are observed for the amino acid instance. The results observed on simulated data also denote the attainment of improved imputations when dealing with large percentages of missing data. CONCLUSIONS: By merging multiple search strategies, machine learning, and phylogenetic techniques, PhyloMissForest provides a highly customizable and robust framework for phylogenetic missing data imputation, with significant topological accuracy and effective speedups over the state of the art. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at (10.1186/s12864-022-08540-6). BioMed Central 2022-05-18 /pmc/articles/PMC9116704/ /pubmed/35585494 http://dx.doi.org/10.1186/s12864-022-08540-6 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Research Pinheiro, Diogo Santander-Jimenéz, Sergio Ilic, Aleksandar PhyloMissForest: a random forest framework to construct phylogenetic trees with missing data
title	PhyloMissForest: a random forest framework to construct phylogenetic trees with missing data
title_full	PhyloMissForest: a random forest framework to construct phylogenetic trees with missing data
title_fullStr	PhyloMissForest: a random forest framework to construct phylogenetic trees with missing data
title_full_unstemmed	PhyloMissForest: a random forest framework to construct phylogenetic trees with missing data
title_short	PhyloMissForest: a random forest framework to construct phylogenetic trees with missing data
title_sort	phylomissforest: a random forest framework to construct phylogenetic trees with missing data
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9116704/ https://www.ncbi.nlm.nih.gov/pubmed/35585494 http://dx.doi.org/10.1186/s12864-022-08540-6
work_keys_str_mv	AT pinheirodiogo phylomissforestarandomforestframeworktoconstructphylogenetictreeswithmissingdata AT santanderjimenezsergio phylomissforestarandomforestframeworktoconstructphylogenetictreeswithmissingdata AT ilicaleksandar phylomissforestarandomforestframeworktoconstructphylogenetictreeswithmissingdata

PhyloMissForest: a random forest framework to construct phylogenetic trees with missing data

Ejemplares similares