Cargando…

An evaluation of alternative methods for constructing phylogenies from whole genome sequence data: a case study with Salmonella

Comparative genomics based on whole genome sequencing (WGS) is increasingly being applied to investigate questions within evolutionary and molecular biology, as well as questions concerning public health (e.g., pathogen outbreaks). Given the impact that conclusions derived from such analyses may hav...

Descripción completa

Detalles Bibliográficos
Autores principales: Pettengill, James B., Luo, Yan, Davis, Steven, Chen, Yi, Gonzalez-Escalona, Narjol, Ottesen, Andrea, Rand, Hugh, Allard, Marc W., Strain, Errol
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4201946/
https://www.ncbi.nlm.nih.gov/pubmed/25332847
http://dx.doi.org/10.7717/peerj.620
_version_ 1782340246113878016
author Pettengill, James B.
Luo, Yan
Davis, Steven
Chen, Yi
Gonzalez-Escalona, Narjol
Ottesen, Andrea
Rand, Hugh
Allard, Marc W.
Strain, Errol
author_facet Pettengill, James B.
Luo, Yan
Davis, Steven
Chen, Yi
Gonzalez-Escalona, Narjol
Ottesen, Andrea
Rand, Hugh
Allard, Marc W.
Strain, Errol
author_sort Pettengill, James B.
collection PubMed
description Comparative genomics based on whole genome sequencing (WGS) is increasingly being applied to investigate questions within evolutionary and molecular biology, as well as questions concerning public health (e.g., pathogen outbreaks). Given the impact that conclusions derived from such analyses may have, we have evaluated the robustness of clustering individuals based on WGS data to three key factors: (1) next-generation sequencing (NGS) platform (HiSeq, MiSeq, IonTorrent, 454, and SOLiD), (2) algorithms used to construct a SNP (single nucleotide polymorphism) matrix (reference-based and reference-free), and (3) phylogenetic inference method (FastTreeMP, GARLI, and RAxML). We carried out these analyses on 194 whole genome sequences representing 107 unique Salmonella enterica subsp. enterica ser. Montevideo strains. Reference-based approaches for identifying SNPs produced trees that were significantly more similar to one another than those produced under the reference-free approach. Topologies inferred using a core matrix (i.e., no missing data) were significantly more discordant than those inferred using a non-core matrix that allows for some missing data. However, allowing for too much missing data likely results in a high false discovery rate of SNPs. When analyzing the same SNP matrix, we observed that the more thorough inference methods implemented in GARLI and RAxML produced more similar topologies than FastTreeMP. Our results also confirm that reproducibility varies among NGS platforms where the MiSeq had the lowest number of pairwise differences among replicate runs. Our investigation into the robustness of clustering patterns illustrates the importance of carefully considering how data from different platforms are combined and analyzed. We found clear differences in the topologies inferred, and certain methods performed significantly better than others for discriminating between the highly clonal organisms investigated here. The methods supported by our results represent a preliminary set of guidelines and a step towards developing validated standards for clustering based on whole genome sequence data.
format Online
Article
Text
id pubmed-4201946
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher PeerJ Inc.
record_format MEDLINE/PubMed
spelling pubmed-42019462014-10-20 An evaluation of alternative methods for constructing phylogenies from whole genome sequence data: a case study with Salmonella Pettengill, James B. Luo, Yan Davis, Steven Chen, Yi Gonzalez-Escalona, Narjol Ottesen, Andrea Rand, Hugh Allard, Marc W. Strain, Errol PeerJ Bioinformatics Comparative genomics based on whole genome sequencing (WGS) is increasingly being applied to investigate questions within evolutionary and molecular biology, as well as questions concerning public health (e.g., pathogen outbreaks). Given the impact that conclusions derived from such analyses may have, we have evaluated the robustness of clustering individuals based on WGS data to three key factors: (1) next-generation sequencing (NGS) platform (HiSeq, MiSeq, IonTorrent, 454, and SOLiD), (2) algorithms used to construct a SNP (single nucleotide polymorphism) matrix (reference-based and reference-free), and (3) phylogenetic inference method (FastTreeMP, GARLI, and RAxML). We carried out these analyses on 194 whole genome sequences representing 107 unique Salmonella enterica subsp. enterica ser. Montevideo strains. Reference-based approaches for identifying SNPs produced trees that were significantly more similar to one another than those produced under the reference-free approach. Topologies inferred using a core matrix (i.e., no missing data) were significantly more discordant than those inferred using a non-core matrix that allows for some missing data. However, allowing for too much missing data likely results in a high false discovery rate of SNPs. When analyzing the same SNP matrix, we observed that the more thorough inference methods implemented in GARLI and RAxML produced more similar topologies than FastTreeMP. Our results also confirm that reproducibility varies among NGS platforms where the MiSeq had the lowest number of pairwise differences among replicate runs. Our investigation into the robustness of clustering patterns illustrates the importance of carefully considering how data from different platforms are combined and analyzed. We found clear differences in the topologies inferred, and certain methods performed significantly better than others for discriminating between the highly clonal organisms investigated here. The methods supported by our results represent a preliminary set of guidelines and a step towards developing validated standards for clustering based on whole genome sequence data. PeerJ Inc. 2014-10-14 /pmc/articles/PMC4201946/ /pubmed/25332847 http://dx.doi.org/10.7717/peerj.620 Text en © 2014 Pettengill et al. http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.
spellingShingle Bioinformatics
Pettengill, James B.
Luo, Yan
Davis, Steven
Chen, Yi
Gonzalez-Escalona, Narjol
Ottesen, Andrea
Rand, Hugh
Allard, Marc W.
Strain, Errol
An evaluation of alternative methods for constructing phylogenies from whole genome sequence data: a case study with Salmonella
title An evaluation of alternative methods for constructing phylogenies from whole genome sequence data: a case study with Salmonella
title_full An evaluation of alternative methods for constructing phylogenies from whole genome sequence data: a case study with Salmonella
title_fullStr An evaluation of alternative methods for constructing phylogenies from whole genome sequence data: a case study with Salmonella
title_full_unstemmed An evaluation of alternative methods for constructing phylogenies from whole genome sequence data: a case study with Salmonella
title_short An evaluation of alternative methods for constructing phylogenies from whole genome sequence data: a case study with Salmonella
title_sort evaluation of alternative methods for constructing phylogenies from whole genome sequence data: a case study with salmonella
topic Bioinformatics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4201946/
https://www.ncbi.nlm.nih.gov/pubmed/25332847
http://dx.doi.org/10.7717/peerj.620
work_keys_str_mv AT pettengilljamesb anevaluationofalternativemethodsforconstructingphylogeniesfromwholegenomesequencedataacasestudywithsalmonella
AT luoyan anevaluationofalternativemethodsforconstructingphylogeniesfromwholegenomesequencedataacasestudywithsalmonella
AT davissteven anevaluationofalternativemethodsforconstructingphylogeniesfromwholegenomesequencedataacasestudywithsalmonella
AT chenyi anevaluationofalternativemethodsforconstructingphylogeniesfromwholegenomesequencedataacasestudywithsalmonella
AT gonzalezescalonanarjol anevaluationofalternativemethodsforconstructingphylogeniesfromwholegenomesequencedataacasestudywithsalmonella
AT ottesenandrea anevaluationofalternativemethodsforconstructingphylogeniesfromwholegenomesequencedataacasestudywithsalmonella
AT randhugh anevaluationofalternativemethodsforconstructingphylogeniesfromwholegenomesequencedataacasestudywithsalmonella
AT allardmarcw anevaluationofalternativemethodsforconstructingphylogeniesfromwholegenomesequencedataacasestudywithsalmonella
AT strainerrol anevaluationofalternativemethodsforconstructingphylogeniesfromwholegenomesequencedataacasestudywithsalmonella
AT pettengilljamesb evaluationofalternativemethodsforconstructingphylogeniesfromwholegenomesequencedataacasestudywithsalmonella
AT luoyan evaluationofalternativemethodsforconstructingphylogeniesfromwholegenomesequencedataacasestudywithsalmonella
AT davissteven evaluationofalternativemethodsforconstructingphylogeniesfromwholegenomesequencedataacasestudywithsalmonella
AT chenyi evaluationofalternativemethodsforconstructingphylogeniesfromwholegenomesequencedataacasestudywithsalmonella
AT gonzalezescalonanarjol evaluationofalternativemethodsforconstructingphylogeniesfromwholegenomesequencedataacasestudywithsalmonella
AT ottesenandrea evaluationofalternativemethodsforconstructingphylogeniesfromwholegenomesequencedataacasestudywithsalmonella
AT randhugh evaluationofalternativemethodsforconstructingphylogeniesfromwholegenomesequencedataacasestudywithsalmonella
AT allardmarcw evaluationofalternativemethodsforconstructingphylogeniesfromwholegenomesequencedataacasestudywithsalmonella
AT strainerrol evaluationofalternativemethodsforconstructingphylogeniesfromwholegenomesequencedataacasestudywithsalmonella