Cargando…

Phylogenetic double placement of mixed samples

MOTIVATION: Consider a simple computational problem. The inputs are (i) the set of mixed reads generated from a sample that combines two organisms and (ii) separate sets of reads for several reference genomes of known origins. The goal is to find the two organisms that constitute the mixed sample. W...

Descripción completa

Detalles Bibliográficos
Autores principales:	Balaban, Metin, Mirarab, Siavash
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2020
Materias:	Population Genomics and Molecular Evolution
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7355250/ https://www.ncbi.nlm.nih.gov/pubmed/32657414 http://dx.doi.org/10.1093/bioinformatics/btaa489

_version_	1783558236919562240
author	Balaban, Metin Mirarab, Siavash
author_facet	Balaban, Metin Mirarab, Siavash
author_sort	Balaban, Metin
collection	PubMed
description	MOTIVATION: Consider a simple computational problem. The inputs are (i) the set of mixed reads generated from a sample that combines two organisms and (ii) separate sets of reads for several reference genomes of known origins. The goal is to find the two organisms that constitute the mixed sample. When constituents are absent from the reference set, we seek to phylogenetically position them with respect to the underlying tree of the reference species. This simple yet fundamental problem (which we call phylogenetic double-placement) has enjoyed surprisingly little attention in the literature. As genome skimming (low-pass sequencing of genomes at low coverage, precluding assembly) becomes more prevalent, this problem finds wide-ranging applications in areas as varied as biodiversity research, food production and provenance, and evolutionary reconstruction. RESULTS: We introduce a model that relates distances between a mixed sample and reference species to the distances between constituents and reference species. Our model is based on Jaccard indices computed between each sample represented as k-mer sets. The model, built on several assumptions and approximations, allows us to formalize the phylogenetic double-placement problem as a non-convex optimization problem that decomposes mixture distances and performs phylogenetic placement simultaneously. Using a variety of techniques, we are able to solve this optimization problem numerically. We test the resulting method, called MIxed Sample Analysis tool (MISA), on a varied set of simulated and biological datasets. Despite all the assumptions used, the method performs remarkably well in practice. AVAILABILITY AND IMPLEMENTATION: The software and data are available at https://github.com/balabanmetin/misa and https://github.com/balabanmetin/misa-data. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format	Online Article Text
id	pubmed-7355250
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-73552502020-07-16 Phylogenetic double placement of mixed samples Balaban, Metin Mirarab, Siavash Bioinformatics Population Genomics and Molecular Evolution MOTIVATION: Consider a simple computational problem. The inputs are (i) the set of mixed reads generated from a sample that combines two organisms and (ii) separate sets of reads for several reference genomes of known origins. The goal is to find the two organisms that constitute the mixed sample. When constituents are absent from the reference set, we seek to phylogenetically position them with respect to the underlying tree of the reference species. This simple yet fundamental problem (which we call phylogenetic double-placement) has enjoyed surprisingly little attention in the literature. As genome skimming (low-pass sequencing of genomes at low coverage, precluding assembly) becomes more prevalent, this problem finds wide-ranging applications in areas as varied as biodiversity research, food production and provenance, and evolutionary reconstruction. RESULTS: We introduce a model that relates distances between a mixed sample and reference species to the distances between constituents and reference species. Our model is based on Jaccard indices computed between each sample represented as k-mer sets. The model, built on several assumptions and approximations, allows us to formalize the phylogenetic double-placement problem as a non-convex optimization problem that decomposes mixture distances and performs phylogenetic placement simultaneously. Using a variety of techniques, we are able to solve this optimization problem numerically. We test the resulting method, called MIxed Sample Analysis tool (MISA), on a varied set of simulated and biological datasets. Despite all the assumptions used, the method performs remarkably well in practice. AVAILABILITY AND IMPLEMENTATION: The software and data are available at https://github.com/balabanmetin/misa and https://github.com/balabanmetin/misa-data. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2020-07 2020-07-13 /pmc/articles/PMC7355250/ /pubmed/32657414 http://dx.doi.org/10.1093/bioinformatics/btaa489 Text en © The Author(s) 2020. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle	Population Genomics and Molecular Evolution Balaban, Metin Mirarab, Siavash Phylogenetic double placement of mixed samples
title	Phylogenetic double placement of mixed samples
title_full	Phylogenetic double placement of mixed samples
title_fullStr	Phylogenetic double placement of mixed samples
title_full_unstemmed	Phylogenetic double placement of mixed samples
title_short	Phylogenetic double placement of mixed samples
title_sort	phylogenetic double placement of mixed samples
topic	Population Genomics and Molecular Evolution
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7355250/ https://www.ncbi.nlm.nih.gov/pubmed/32657414 http://dx.doi.org/10.1093/bioinformatics/btaa489
work_keys_str_mv	AT balabanmetin phylogeneticdoubleplacementofmixedsamples AT mirarabsiavash phylogeneticdoubleplacementofmixedsamples

Phylogenetic double placement of mixed samples

Ejemplares similares