Cargando…

Benchmark datasets for phylogenomic pipeline validation, applications for foodborne pathogen surveillance

BACKGROUND: As next generation sequence technology has advanced, there have been parallel advances in genome-scale analysis programs for determining evolutionary relationships as proxies for epidemiological relationship in public health. Most new programs skip traditional steps of ortholog determina...

Descripción completa

Detalles Bibliográficos
Autores principales: Timme, Ruth E., Rand, Hugh, Shumway, Martin, Trees, Eija K., Simmons, Mustafa, Agarwala, Richa, Davis, Steven, Tillman, Glenn E., Defibaugh-Chavez, Stephanie, Carleton, Heather A., Klimke, William A., Katz, Lee S.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5782805/
https://www.ncbi.nlm.nih.gov/pubmed/29372115
http://dx.doi.org/10.7717/peerj.3893
_version_ 1783295206308708352
author Timme, Ruth E.
Rand, Hugh
Shumway, Martin
Trees, Eija K.
Simmons, Mustafa
Agarwala, Richa
Davis, Steven
Tillman, Glenn E.
Defibaugh-Chavez, Stephanie
Carleton, Heather A.
Klimke, William A.
Katz, Lee S.
author_facet Timme, Ruth E.
Rand, Hugh
Shumway, Martin
Trees, Eija K.
Simmons, Mustafa
Agarwala, Richa
Davis, Steven
Tillman, Glenn E.
Defibaugh-Chavez, Stephanie
Carleton, Heather A.
Klimke, William A.
Katz, Lee S.
author_sort Timme, Ruth E.
collection PubMed
description BACKGROUND: As next generation sequence technology has advanced, there have been parallel advances in genome-scale analysis programs for determining evolutionary relationships as proxies for epidemiological relationship in public health. Most new programs skip traditional steps of ortholog determination and multi-gene alignment, instead identifying variants across a set of genomes, then summarizing results in a matrix of single-nucleotide polymorphisms or alleles for standard phylogenetic analysis. However, public health authorities need to document the performance of these methods with appropriate and comprehensive datasets so they can be validated for specific purposes, e.g., outbreak surveillance. Here we propose a set of benchmark datasets to be used for comparison and validation of phylogenomic pipelines. METHODS: We identified four well-documented foodborne pathogen events in which the epidemiology was concordant with routine phylogenomic analyses (reference-based SNP and wgMLST approaches). These are ideal benchmark datasets, as the trees, WGS data, and epidemiological data for each are all in agreement. We have placed these sequence data, sample metadata, and “known” phylogenetic trees in publicly-accessible databases and developed a standard descriptive spreadsheet format describing each dataset. To facilitate easy downloading of these benchmarks, we developed an automated script that uses the standard descriptive spreadsheet format. RESULTS: Our “outbreak” benchmark datasets represent the four major foodborne bacterial pathogens (Listeria monocytogenes, Salmonella enterica, Escherichia coli, and Campylobacter jejuni) and one simulated dataset where the “known tree” can be accurately called the “true tree”. The downloading script and associated table files are available on GitHub: https://github.com/WGS-standards-and-analysis/datasets. DISCUSSION: These five benchmark datasets will help standardize comparison of current and future phylogenomic pipelines, and facilitate important cross-institutional collaborations. Our work is part of a global effort to provide collaborative infrastructure for sequence data and analytic tools—we welcome additional benchmark datasets in our recommended format, and, if relevant, we will add these on our GitHub site. Together, these datasets, dataset format, and the underlying GitHub infrastructure present a recommended path for worldwide standardization of phylogenomic pipelines.
format Online
Article
Text
id pubmed-5782805
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher PeerJ Inc.
record_format MEDLINE/PubMed
spelling pubmed-57828052018-01-25 Benchmark datasets for phylogenomic pipeline validation, applications for foodborne pathogen surveillance Timme, Ruth E. Rand, Hugh Shumway, Martin Trees, Eija K. Simmons, Mustafa Agarwala, Richa Davis, Steven Tillman, Glenn E. Defibaugh-Chavez, Stephanie Carleton, Heather A. Klimke, William A. Katz, Lee S. PeerJ Bioinformatics BACKGROUND: As next generation sequence technology has advanced, there have been parallel advances in genome-scale analysis programs for determining evolutionary relationships as proxies for epidemiological relationship in public health. Most new programs skip traditional steps of ortholog determination and multi-gene alignment, instead identifying variants across a set of genomes, then summarizing results in a matrix of single-nucleotide polymorphisms or alleles for standard phylogenetic analysis. However, public health authorities need to document the performance of these methods with appropriate and comprehensive datasets so they can be validated for specific purposes, e.g., outbreak surveillance. Here we propose a set of benchmark datasets to be used for comparison and validation of phylogenomic pipelines. METHODS: We identified four well-documented foodborne pathogen events in which the epidemiology was concordant with routine phylogenomic analyses (reference-based SNP and wgMLST approaches). These are ideal benchmark datasets, as the trees, WGS data, and epidemiological data for each are all in agreement. We have placed these sequence data, sample metadata, and “known” phylogenetic trees in publicly-accessible databases and developed a standard descriptive spreadsheet format describing each dataset. To facilitate easy downloading of these benchmarks, we developed an automated script that uses the standard descriptive spreadsheet format. RESULTS: Our “outbreak” benchmark datasets represent the four major foodborne bacterial pathogens (Listeria monocytogenes, Salmonella enterica, Escherichia coli, and Campylobacter jejuni) and one simulated dataset where the “known tree” can be accurately called the “true tree”. The downloading script and associated table files are available on GitHub: https://github.com/WGS-standards-and-analysis/datasets. DISCUSSION: These five benchmark datasets will help standardize comparison of current and future phylogenomic pipelines, and facilitate important cross-institutional collaborations. Our work is part of a global effort to provide collaborative infrastructure for sequence data and analytic tools—we welcome additional benchmark datasets in our recommended format, and, if relevant, we will add these on our GitHub site. Together, these datasets, dataset format, and the underlying GitHub infrastructure present a recommended path for worldwide standardization of phylogenomic pipelines. PeerJ Inc. 2017-10-06 /pmc/articles/PMC5782805/ /pubmed/29372115 http://dx.doi.org/10.7717/peerj.3893 Text en http://creativecommons.org/publicdomain/zero/1.0/ This is an open access article, free of all copyright, made available under the Creative Commons Public Domain Dedication (http://creativecommons.org/publicdomain/zero/1.0/) . This work may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose.
spellingShingle Bioinformatics
Timme, Ruth E.
Rand, Hugh
Shumway, Martin
Trees, Eija K.
Simmons, Mustafa
Agarwala, Richa
Davis, Steven
Tillman, Glenn E.
Defibaugh-Chavez, Stephanie
Carleton, Heather A.
Klimke, William A.
Katz, Lee S.
Benchmark datasets for phylogenomic pipeline validation, applications for foodborne pathogen surveillance
title Benchmark datasets for phylogenomic pipeline validation, applications for foodborne pathogen surveillance
title_full Benchmark datasets for phylogenomic pipeline validation, applications for foodborne pathogen surveillance
title_fullStr Benchmark datasets for phylogenomic pipeline validation, applications for foodborne pathogen surveillance
title_full_unstemmed Benchmark datasets for phylogenomic pipeline validation, applications for foodborne pathogen surveillance
title_short Benchmark datasets for phylogenomic pipeline validation, applications for foodborne pathogen surveillance
title_sort benchmark datasets for phylogenomic pipeline validation, applications for foodborne pathogen surveillance
topic Bioinformatics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5782805/
https://www.ncbi.nlm.nih.gov/pubmed/29372115
http://dx.doi.org/10.7717/peerj.3893
work_keys_str_mv AT timmeruthe benchmarkdatasetsforphylogenomicpipelinevalidationapplicationsforfoodbornepathogensurveillance
AT randhugh benchmarkdatasetsforphylogenomicpipelinevalidationapplicationsforfoodbornepathogensurveillance
AT shumwaymartin benchmarkdatasetsforphylogenomicpipelinevalidationapplicationsforfoodbornepathogensurveillance
AT treeseijak benchmarkdatasetsforphylogenomicpipelinevalidationapplicationsforfoodbornepathogensurveillance
AT simmonsmustafa benchmarkdatasetsforphylogenomicpipelinevalidationapplicationsforfoodbornepathogensurveillance
AT agarwalaricha benchmarkdatasetsforphylogenomicpipelinevalidationapplicationsforfoodbornepathogensurveillance
AT davissteven benchmarkdatasetsforphylogenomicpipelinevalidationapplicationsforfoodbornepathogensurveillance
AT tillmanglenne benchmarkdatasetsforphylogenomicpipelinevalidationapplicationsforfoodbornepathogensurveillance
AT defibaughchavezstephanie benchmarkdatasetsforphylogenomicpipelinevalidationapplicationsforfoodbornepathogensurveillance
AT carletonheathera benchmarkdatasetsforphylogenomicpipelinevalidationapplicationsforfoodbornepathogensurveillance
AT klimkewilliama benchmarkdatasetsforphylogenomicpipelinevalidationapplicationsforfoodbornepathogensurveillance
AT katzlees benchmarkdatasetsforphylogenomicpipelinevalidationapplicationsforfoodbornepathogensurveillance