Cargando…

Machine learning based imputation techniques for estimating phylogenetic trees from incomplete distance matrices

BACKGROUND: With the rapid growth rate of newly sequenced genomes, species tree inference from genes sampled throughout the whole genome has become a basic task in comparative and evolutionary biology. However, substantial challenges remain in leveraging these large scale molecular data. One of the...

Descripción completa

Detalles Bibliográficos
Autores principales: Bhattacharjee, Ananya, Bayzid, Md. Shamsuzzoha
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7370488/
https://www.ncbi.nlm.nih.gov/pubmed/32689946
http://dx.doi.org/10.1186/s12864-020-06892-5
_version_ 1783560987555659776
author Bhattacharjee, Ananya
Bayzid, Md. Shamsuzzoha
author_facet Bhattacharjee, Ananya
Bayzid, Md. Shamsuzzoha
author_sort Bhattacharjee, Ananya
collection PubMed
description BACKGROUND: With the rapid growth rate of newly sequenced genomes, species tree inference from genes sampled throughout the whole genome has become a basic task in comparative and evolutionary biology. However, substantial challenges remain in leveraging these large scale molecular data. One of the foremost challenges is to develop efficient methods that can handle missing data. Popular distance-based methods, such as NJ (neighbor joining) and UPGMA (unweighted pair group method with arithmetic mean) require complete distance matrices without any missing data. RESULTS: We introduce two highly accurate machine learning based distance imputation techniques. These methods are based on matrix factorization and autoencoder based deep learning architectures. We evaluated these two methods on a collection of simulated and biological datasets. Experimental results suggest that our proposed methods match or improve upon the best alternate distance imputation techniques. Moreover, these methods are scalable to large datasets with hundreds of taxa, and can handle a substantial amount of missing data. CONCLUSIONS: This study shows, for the first time, the power and feasibility of applying deep learning techniques for imputing distance matrices. Thus, this study advances the state-of-the-art in phylogenetic tree construction in the presence of missing data. The proposed methods are available in open source form at https://github.com/Ananya-Bhattacharjee/ImputeDistances.
format Online
Article
Text
id pubmed-7370488
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-73704882020-07-21 Machine learning based imputation techniques for estimating phylogenetic trees from incomplete distance matrices Bhattacharjee, Ananya Bayzid, Md. Shamsuzzoha BMC Genomics Methodology Article BACKGROUND: With the rapid growth rate of newly sequenced genomes, species tree inference from genes sampled throughout the whole genome has become a basic task in comparative and evolutionary biology. However, substantial challenges remain in leveraging these large scale molecular data. One of the foremost challenges is to develop efficient methods that can handle missing data. Popular distance-based methods, such as NJ (neighbor joining) and UPGMA (unweighted pair group method with arithmetic mean) require complete distance matrices without any missing data. RESULTS: We introduce two highly accurate machine learning based distance imputation techniques. These methods are based on matrix factorization and autoencoder based deep learning architectures. We evaluated these two methods on a collection of simulated and biological datasets. Experimental results suggest that our proposed methods match or improve upon the best alternate distance imputation techniques. Moreover, these methods are scalable to large datasets with hundreds of taxa, and can handle a substantial amount of missing data. CONCLUSIONS: This study shows, for the first time, the power and feasibility of applying deep learning techniques for imputing distance matrices. Thus, this study advances the state-of-the-art in phylogenetic tree construction in the presence of missing data. The proposed methods are available in open source form at https://github.com/Ananya-Bhattacharjee/ImputeDistances. BioMed Central 2020-07-20 /pmc/articles/PMC7370488/ /pubmed/32689946 http://dx.doi.org/10.1186/s12864-020-06892-5 Text en © The Author(s) 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Methodology Article
Bhattacharjee, Ananya
Bayzid, Md. Shamsuzzoha
Machine learning based imputation techniques for estimating phylogenetic trees from incomplete distance matrices
title Machine learning based imputation techniques for estimating phylogenetic trees from incomplete distance matrices
title_full Machine learning based imputation techniques for estimating phylogenetic trees from incomplete distance matrices
title_fullStr Machine learning based imputation techniques for estimating phylogenetic trees from incomplete distance matrices
title_full_unstemmed Machine learning based imputation techniques for estimating phylogenetic trees from incomplete distance matrices
title_short Machine learning based imputation techniques for estimating phylogenetic trees from incomplete distance matrices
title_sort machine learning based imputation techniques for estimating phylogenetic trees from incomplete distance matrices
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7370488/
https://www.ncbi.nlm.nih.gov/pubmed/32689946
http://dx.doi.org/10.1186/s12864-020-06892-5
work_keys_str_mv AT bhattacharjeeananya machinelearningbasedimputationtechniquesforestimatingphylogenetictreesfromincompletedistancematrices
AT bayzidmdshamsuzzoha machinelearningbasedimputationtechniquesforestimatingphylogenetictreesfromincompletedistancematrices