Cargando…

Bacterial whole genome-based phylogeny: construction of a new benchmarking dataset and assessment of some existing methods

BACKGROUND: Whole genome sequencing (WGS) is increasingly used in diagnostics and surveillance of infectious diseases. A major application for WGS is to use the data for identifying outbreak clusters, and there is therefore a need for methods that can accurately and efficiently infer phylogenies fro...

Descripción completa

Detalles Bibliográficos
Autores principales:	Ahrenfeldt, Johanne, Skaarup, Carina, Hasman, Henrik, Pedersen, Anders Gorm, Aarestrup, Frank Møller, Lund, Ole
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2017
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5217230/ https://www.ncbi.nlm.nih.gov/pubmed/28056767 http://dx.doi.org/10.1186/s12864-016-3407-6

_version_	1782492067354640384
author	Ahrenfeldt, Johanne Skaarup, Carina Hasman, Henrik Pedersen, Anders Gorm Aarestrup, Frank Møller Lund, Ole
author_facet	Ahrenfeldt, Johanne Skaarup, Carina Hasman, Henrik Pedersen, Anders Gorm Aarestrup, Frank Møller Lund, Ole
author_sort	Ahrenfeldt, Johanne
collection	PubMed
description	BACKGROUND: Whole genome sequencing (WGS) is increasingly used in diagnostics and surveillance of infectious diseases. A major application for WGS is to use the data for identifying outbreak clusters, and there is therefore a need for methods that can accurately and efficiently infer phylogenies from sequencing reads. In the present study we describe a new dataset that we have created for the purpose of benchmarking such WGS-based methods for epidemiological data, and also present an analysis where we use the data to compare the performance of some current methods. RESULTS: Our aim was to create a benchmark data set that mimics sequencing data of the sort that might be collected during an outbreak of an infectious disease. This was achieved by letting an E. coli hypermutator strain grow in the lab for 8 consecutive days, each day splitting the culture in two while also collecting samples for sequencing. The result is a data set consisting of 101 whole genome sequences with known phylogenetic relationship. Among the sequenced samples 51 correspond to internal nodes in the phylogeny because they are ancestral, while the remaining 50 correspond to leaves. We also used the newly created data set to compare three different online available methods that infer phylogenies from whole-genome sequencing reads: NDtree, CSI Phylogeny and REALPHY. One complication when comparing the output of these methods with the known phylogeny is that phylogenetic methods typically build trees where all observed sequences are placed as leafs, even though some of them are in fact ancestral. We therefore devised a method for post processing the inferred trees by collapsing short branches (thus relocating some leafs to internal nodes), and also present two new measures of tree similarity that takes into account the identity of both internal and leaf nodes. CONCLUSIONS: Based on this analysis we find that, among the investigated methods, CSI Phylogeny had the best performance, correctly identifying 73% of all branches in the tree and 71% of all clades. We have made all data from this experiment (raw sequencing reads, consensus whole-genome sequences, as well as descriptions of the known phylogeny in a variety of formats) publicly available, with the hope that other groups may find this data useful for benchmarking and exploring the performance of epidemiological methods. All data is freely available at: https://cge.cbs.dtu.dk/services/evolution_data.php. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12864-016-3407-6) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-5217230
institution	National Center for Biotechnology Information
language	English
publishDate	2017
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-52172302017-01-09 Bacterial whole genome-based phylogeny: construction of a new benchmarking dataset and assessment of some existing methods Ahrenfeldt, Johanne Skaarup, Carina Hasman, Henrik Pedersen, Anders Gorm Aarestrup, Frank Møller Lund, Ole BMC Genomics Research Article BACKGROUND: Whole genome sequencing (WGS) is increasingly used in diagnostics and surveillance of infectious diseases. A major application for WGS is to use the data for identifying outbreak clusters, and there is therefore a need for methods that can accurately and efficiently infer phylogenies from sequencing reads. In the present study we describe a new dataset that we have created for the purpose of benchmarking such WGS-based methods for epidemiological data, and also present an analysis where we use the data to compare the performance of some current methods. RESULTS: Our aim was to create a benchmark data set that mimics sequencing data of the sort that might be collected during an outbreak of an infectious disease. This was achieved by letting an E. coli hypermutator strain grow in the lab for 8 consecutive days, each day splitting the culture in two while also collecting samples for sequencing. The result is a data set consisting of 101 whole genome sequences with known phylogenetic relationship. Among the sequenced samples 51 correspond to internal nodes in the phylogeny because they are ancestral, while the remaining 50 correspond to leaves. We also used the newly created data set to compare three different online available methods that infer phylogenies from whole-genome sequencing reads: NDtree, CSI Phylogeny and REALPHY. One complication when comparing the output of these methods with the known phylogeny is that phylogenetic methods typically build trees where all observed sequences are placed as leafs, even though some of them are in fact ancestral. We therefore devised a method for post processing the inferred trees by collapsing short branches (thus relocating some leafs to internal nodes), and also present two new measures of tree similarity that takes into account the identity of both internal and leaf nodes. CONCLUSIONS: Based on this analysis we find that, among the investigated methods, CSI Phylogeny had the best performance, correctly identifying 73% of all branches in the tree and 71% of all clades. We have made all data from this experiment (raw sequencing reads, consensus whole-genome sequences, as well as descriptions of the known phylogeny in a variety of formats) publicly available, with the hope that other groups may find this data useful for benchmarking and exploring the performance of epidemiological methods. All data is freely available at: https://cge.cbs.dtu.dk/services/evolution_data.php. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12864-016-3407-6) contains supplementary material, which is available to authorized users. BioMed Central 2017-01-05 /pmc/articles/PMC5217230/ /pubmed/28056767 http://dx.doi.org/10.1186/s12864-016-3407-6 Text en © The Author(s). 2017 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Ahrenfeldt, Johanne Skaarup, Carina Hasman, Henrik Pedersen, Anders Gorm Aarestrup, Frank Møller Lund, Ole Bacterial whole genome-based phylogeny: construction of a new benchmarking dataset and assessment of some existing methods
title	Bacterial whole genome-based phylogeny: construction of a new benchmarking dataset and assessment of some existing methods
title_full	Bacterial whole genome-based phylogeny: construction of a new benchmarking dataset and assessment of some existing methods
title_fullStr	Bacterial whole genome-based phylogeny: construction of a new benchmarking dataset and assessment of some existing methods
title_full_unstemmed	Bacterial whole genome-based phylogeny: construction of a new benchmarking dataset and assessment of some existing methods
title_short	Bacterial whole genome-based phylogeny: construction of a new benchmarking dataset and assessment of some existing methods
title_sort	bacterial whole genome-based phylogeny: construction of a new benchmarking dataset and assessment of some existing methods
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5217230/ https://www.ncbi.nlm.nih.gov/pubmed/28056767 http://dx.doi.org/10.1186/s12864-016-3407-6
work_keys_str_mv	AT ahrenfeldtjohanne bacterialwholegenomebasedphylogenyconstructionofanewbenchmarkingdatasetandassessmentofsomeexistingmethods AT skaarupcarina bacterialwholegenomebasedphylogenyconstructionofanewbenchmarkingdatasetandassessmentofsomeexistingmethods AT hasmanhenrik bacterialwholegenomebasedphylogenyconstructionofanewbenchmarkingdatasetandassessmentofsomeexistingmethods AT pedersenandersgorm bacterialwholegenomebasedphylogenyconstructionofanewbenchmarkingdatasetandassessmentofsomeexistingmethods AT aarestrupfrankmøller bacterialwholegenomebasedphylogenyconstructionofanewbenchmarkingdatasetandassessmentofsomeexistingmethods AT lundole bacterialwholegenomebasedphylogenyconstructionofanewbenchmarkingdatasetandassessmentofsomeexistingmethods

Bacterial whole genome-based phylogeny: construction of a new benchmarking dataset and assessment of some existing methods

Ejemplares similares