Cargando…
An evaluation of alternative methods for constructing phylogenies from whole genome sequence data: a case study with Salmonella
Comparative genomics based on whole genome sequencing (WGS) is increasingly being applied to investigate questions within evolutionary and molecular biology, as well as questions concerning public health (e.g., pathogen outbreaks). Given the impact that conclusions derived from such analyses may hav...
Autores principales: | , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
PeerJ Inc.
2014
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4201946/ https://www.ncbi.nlm.nih.gov/pubmed/25332847 http://dx.doi.org/10.7717/peerj.620 |
_version_ | 1782340246113878016 |
---|---|
author | Pettengill, James B. Luo, Yan Davis, Steven Chen, Yi Gonzalez-Escalona, Narjol Ottesen, Andrea Rand, Hugh Allard, Marc W. Strain, Errol |
author_facet | Pettengill, James B. Luo, Yan Davis, Steven Chen, Yi Gonzalez-Escalona, Narjol Ottesen, Andrea Rand, Hugh Allard, Marc W. Strain, Errol |
author_sort | Pettengill, James B. |
collection | PubMed |
description | Comparative genomics based on whole genome sequencing (WGS) is increasingly being applied to investigate questions within evolutionary and molecular biology, as well as questions concerning public health (e.g., pathogen outbreaks). Given the impact that conclusions derived from such analyses may have, we have evaluated the robustness of clustering individuals based on WGS data to three key factors: (1) next-generation sequencing (NGS) platform (HiSeq, MiSeq, IonTorrent, 454, and SOLiD), (2) algorithms used to construct a SNP (single nucleotide polymorphism) matrix (reference-based and reference-free), and (3) phylogenetic inference method (FastTreeMP, GARLI, and RAxML). We carried out these analyses on 194 whole genome sequences representing 107 unique Salmonella enterica subsp. enterica ser. Montevideo strains. Reference-based approaches for identifying SNPs produced trees that were significantly more similar to one another than those produced under the reference-free approach. Topologies inferred using a core matrix (i.e., no missing data) were significantly more discordant than those inferred using a non-core matrix that allows for some missing data. However, allowing for too much missing data likely results in a high false discovery rate of SNPs. When analyzing the same SNP matrix, we observed that the more thorough inference methods implemented in GARLI and RAxML produced more similar topologies than FastTreeMP. Our results also confirm that reproducibility varies among NGS platforms where the MiSeq had the lowest number of pairwise differences among replicate runs. Our investigation into the robustness of clustering patterns illustrates the importance of carefully considering how data from different platforms are combined and analyzed. We found clear differences in the topologies inferred, and certain methods performed significantly better than others for discriminating between the highly clonal organisms investigated here. The methods supported by our results represent a preliminary set of guidelines and a step towards developing validated standards for clustering based on whole genome sequence data. |
format | Online Article Text |
id | pubmed-4201946 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2014 |
publisher | PeerJ Inc. |
record_format | MEDLINE/PubMed |
spelling | pubmed-42019462014-10-20 An evaluation of alternative methods for constructing phylogenies from whole genome sequence data: a case study with Salmonella Pettengill, James B. Luo, Yan Davis, Steven Chen, Yi Gonzalez-Escalona, Narjol Ottesen, Andrea Rand, Hugh Allard, Marc W. Strain, Errol PeerJ Bioinformatics Comparative genomics based on whole genome sequencing (WGS) is increasingly being applied to investigate questions within evolutionary and molecular biology, as well as questions concerning public health (e.g., pathogen outbreaks). Given the impact that conclusions derived from such analyses may have, we have evaluated the robustness of clustering individuals based on WGS data to three key factors: (1) next-generation sequencing (NGS) platform (HiSeq, MiSeq, IonTorrent, 454, and SOLiD), (2) algorithms used to construct a SNP (single nucleotide polymorphism) matrix (reference-based and reference-free), and (3) phylogenetic inference method (FastTreeMP, GARLI, and RAxML). We carried out these analyses on 194 whole genome sequences representing 107 unique Salmonella enterica subsp. enterica ser. Montevideo strains. Reference-based approaches for identifying SNPs produced trees that were significantly more similar to one another than those produced under the reference-free approach. Topologies inferred using a core matrix (i.e., no missing data) were significantly more discordant than those inferred using a non-core matrix that allows for some missing data. However, allowing for too much missing data likely results in a high false discovery rate of SNPs. When analyzing the same SNP matrix, we observed that the more thorough inference methods implemented in GARLI and RAxML produced more similar topologies than FastTreeMP. Our results also confirm that reproducibility varies among NGS platforms where the MiSeq had the lowest number of pairwise differences among replicate runs. Our investigation into the robustness of clustering patterns illustrates the importance of carefully considering how data from different platforms are combined and analyzed. We found clear differences in the topologies inferred, and certain methods performed significantly better than others for discriminating between the highly clonal organisms investigated here. The methods supported by our results represent a preliminary set of guidelines and a step towards developing validated standards for clustering based on whole genome sequence data. PeerJ Inc. 2014-10-14 /pmc/articles/PMC4201946/ /pubmed/25332847 http://dx.doi.org/10.7717/peerj.620 Text en © 2014 Pettengill et al. http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited. |
spellingShingle | Bioinformatics Pettengill, James B. Luo, Yan Davis, Steven Chen, Yi Gonzalez-Escalona, Narjol Ottesen, Andrea Rand, Hugh Allard, Marc W. Strain, Errol An evaluation of alternative methods for constructing phylogenies from whole genome sequence data: a case study with Salmonella |
title | An evaluation of alternative methods for constructing phylogenies from whole genome sequence data: a case study with Salmonella |
title_full | An evaluation of alternative methods for constructing phylogenies from whole genome sequence data: a case study with Salmonella |
title_fullStr | An evaluation of alternative methods for constructing phylogenies from whole genome sequence data: a case study with Salmonella |
title_full_unstemmed | An evaluation of alternative methods for constructing phylogenies from whole genome sequence data: a case study with Salmonella |
title_short | An evaluation of alternative methods for constructing phylogenies from whole genome sequence data: a case study with Salmonella |
title_sort | evaluation of alternative methods for constructing phylogenies from whole genome sequence data: a case study with salmonella |
topic | Bioinformatics |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4201946/ https://www.ncbi.nlm.nih.gov/pubmed/25332847 http://dx.doi.org/10.7717/peerj.620 |
work_keys_str_mv | AT pettengilljamesb anevaluationofalternativemethodsforconstructingphylogeniesfromwholegenomesequencedataacasestudywithsalmonella AT luoyan anevaluationofalternativemethodsforconstructingphylogeniesfromwholegenomesequencedataacasestudywithsalmonella AT davissteven anevaluationofalternativemethodsforconstructingphylogeniesfromwholegenomesequencedataacasestudywithsalmonella AT chenyi anevaluationofalternativemethodsforconstructingphylogeniesfromwholegenomesequencedataacasestudywithsalmonella AT gonzalezescalonanarjol anevaluationofalternativemethodsforconstructingphylogeniesfromwholegenomesequencedataacasestudywithsalmonella AT ottesenandrea anevaluationofalternativemethodsforconstructingphylogeniesfromwholegenomesequencedataacasestudywithsalmonella AT randhugh anevaluationofalternativemethodsforconstructingphylogeniesfromwholegenomesequencedataacasestudywithsalmonella AT allardmarcw anevaluationofalternativemethodsforconstructingphylogeniesfromwholegenomesequencedataacasestudywithsalmonella AT strainerrol anevaluationofalternativemethodsforconstructingphylogeniesfromwholegenomesequencedataacasestudywithsalmonella AT pettengilljamesb evaluationofalternativemethodsforconstructingphylogeniesfromwholegenomesequencedataacasestudywithsalmonella AT luoyan evaluationofalternativemethodsforconstructingphylogeniesfromwholegenomesequencedataacasestudywithsalmonella AT davissteven evaluationofalternativemethodsforconstructingphylogeniesfromwholegenomesequencedataacasestudywithsalmonella AT chenyi evaluationofalternativemethodsforconstructingphylogeniesfromwholegenomesequencedataacasestudywithsalmonella AT gonzalezescalonanarjol evaluationofalternativemethodsforconstructingphylogeniesfromwholegenomesequencedataacasestudywithsalmonella AT ottesenandrea evaluationofalternativemethodsforconstructingphylogeniesfromwholegenomesequencedataacasestudywithsalmonella AT randhugh evaluationofalternativemethodsforconstructingphylogeniesfromwholegenomesequencedataacasestudywithsalmonella AT allardmarcw evaluationofalternativemethodsforconstructingphylogeniesfromwholegenomesequencedataacasestudywithsalmonella AT strainerrol evaluationofalternativemethodsforconstructingphylogeniesfromwholegenomesequencedataacasestudywithsalmonella |