Cargando…

Comprehensive assessment of the quality of Salmonella whole genome sequence data available in public sequence databases using the Salmonella in silico Typing Resource (SISTR)

Public health and food safety institutions around the world are adopting whole genome sequencing (WGS) to replace conventional methods for characterizing Salmonella for use in surveillance and outbreak response. Falling costs and increased throughput of WGS have resulted in an explosion of data, but...

Descripción completa

Detalles Bibliográficos
Autores principales: Robertson, James, Yoshida, Catherine, Kruczkiewicz, Peter, Nadon, Celine, Nichani, Anil, Taboada, Eduardo N., Nash, John Howard Eagles
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Microbiology Society 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5857378/
https://www.ncbi.nlm.nih.gov/pubmed/29338812
http://dx.doi.org/10.1099/mgen.0.000151
_version_ 1783307460593844224
author Robertson, James
Yoshida, Catherine
Kruczkiewicz, Peter
Nadon, Celine
Nichani, Anil
Taboada, Eduardo N.
Nash, John Howard Eagles
author_facet Robertson, James
Yoshida, Catherine
Kruczkiewicz, Peter
Nadon, Celine
Nichani, Anil
Taboada, Eduardo N.
Nash, John Howard Eagles
author_sort Robertson, James
collection PubMed
description Public health and food safety institutions around the world are adopting whole genome sequencing (WGS) to replace conventional methods for characterizing Salmonella for use in surveillance and outbreak response. Falling costs and increased throughput of WGS have resulted in an explosion of data, but questions remain as to the reliability and robustness of the data. Due to the critical importance of serovar information to public health, it is essential to have reliable serovar assignments available for all of the Salmonella records. The current study used a systematic assessment and curation of all Salmonella in the sequence read archive (SRA) to assess the state of the data and their utility. A total of 67 758 genomes were assembled de novo and quality-assessed for their assembly metrics as well as species and serovar assignments. A total of 42 400 genomes passed all of the quality criteria but 30.16 % of genomes were deposited without serotype information. These data were used to compare the concordance of reported and predicted serovars for two in silico prediction tools, multi-locus sequence typing (MLST) and the Salmonella in silico Typing Resource (SISTR), which produced predictions that were fully concordant with 87.51 and 91.91 % of the tested isolates, respectively. Concordance of in silico predictions increased when serovar variants were grouped together, 89.25 % for MLST and 94.98 % for SISTR. This study represents the first large-scale validation of serovar information in public genomes and provides a large validated set of genomes, which can be used to benchmark new bioinformatics tools.
format Online
Article
Text
id pubmed-5857378
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher Microbiology Society
record_format MEDLINE/PubMed
spelling pubmed-58573782018-04-05 Comprehensive assessment of the quality of Salmonella whole genome sequence data available in public sequence databases using the Salmonella in silico Typing Resource (SISTR) Robertson, James Yoshida, Catherine Kruczkiewicz, Peter Nadon, Celine Nichani, Anil Taboada, Eduardo N. Nash, John Howard Eagles Microb Genom Research Article Public health and food safety institutions around the world are adopting whole genome sequencing (WGS) to replace conventional methods for characterizing Salmonella for use in surveillance and outbreak response. Falling costs and increased throughput of WGS have resulted in an explosion of data, but questions remain as to the reliability and robustness of the data. Due to the critical importance of serovar information to public health, it is essential to have reliable serovar assignments available for all of the Salmonella records. The current study used a systematic assessment and curation of all Salmonella in the sequence read archive (SRA) to assess the state of the data and their utility. A total of 67 758 genomes were assembled de novo and quality-assessed for their assembly metrics as well as species and serovar assignments. A total of 42 400 genomes passed all of the quality criteria but 30.16 % of genomes were deposited without serotype information. These data were used to compare the concordance of reported and predicted serovars for two in silico prediction tools, multi-locus sequence typing (MLST) and the Salmonella in silico Typing Resource (SISTR), which produced predictions that were fully concordant with 87.51 and 91.91 % of the tested isolates, respectively. Concordance of in silico predictions increased when serovar variants were grouped together, 89.25 % for MLST and 94.98 % for SISTR. This study represents the first large-scale validation of serovar information in public genomes and provides a large validated set of genomes, which can be used to benchmark new bioinformatics tools. Microbiology Society 2018-01-17 /pmc/articles/PMC5857378/ /pubmed/29338812 http://dx.doi.org/10.1099/mgen.0.000151 Text en © 2018 http://creativecommons.org/licenses/by/4.0/ This is an open access article under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Robertson, James
Yoshida, Catherine
Kruczkiewicz, Peter
Nadon, Celine
Nichani, Anil
Taboada, Eduardo N.
Nash, John Howard Eagles
Comprehensive assessment of the quality of Salmonella whole genome sequence data available in public sequence databases using the Salmonella in silico Typing Resource (SISTR)
title Comprehensive assessment of the quality of Salmonella whole genome sequence data available in public sequence databases using the Salmonella in silico Typing Resource (SISTR)
title_full Comprehensive assessment of the quality of Salmonella whole genome sequence data available in public sequence databases using the Salmonella in silico Typing Resource (SISTR)
title_fullStr Comprehensive assessment of the quality of Salmonella whole genome sequence data available in public sequence databases using the Salmonella in silico Typing Resource (SISTR)
title_full_unstemmed Comprehensive assessment of the quality of Salmonella whole genome sequence data available in public sequence databases using the Salmonella in silico Typing Resource (SISTR)
title_short Comprehensive assessment of the quality of Salmonella whole genome sequence data available in public sequence databases using the Salmonella in silico Typing Resource (SISTR)
title_sort comprehensive assessment of the quality of salmonella whole genome sequence data available in public sequence databases using the salmonella in silico typing resource (sistr)
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5857378/
https://www.ncbi.nlm.nih.gov/pubmed/29338812
http://dx.doi.org/10.1099/mgen.0.000151
work_keys_str_mv AT robertsonjames comprehensiveassessmentofthequalityofsalmonellawholegenomesequencedataavailableinpublicsequencedatabasesusingthesalmonellainsilicotypingresourcesistr
AT yoshidacatherine comprehensiveassessmentofthequalityofsalmonellawholegenomesequencedataavailableinpublicsequencedatabasesusingthesalmonellainsilicotypingresourcesistr
AT kruczkiewiczpeter comprehensiveassessmentofthequalityofsalmonellawholegenomesequencedataavailableinpublicsequencedatabasesusingthesalmonellainsilicotypingresourcesistr
AT nadonceline comprehensiveassessmentofthequalityofsalmonellawholegenomesequencedataavailableinpublicsequencedatabasesusingthesalmonellainsilicotypingresourcesistr
AT nichanianil comprehensiveassessmentofthequalityofsalmonellawholegenomesequencedataavailableinpublicsequencedatabasesusingthesalmonellainsilicotypingresourcesistr
AT taboadaeduardon comprehensiveassessmentofthequalityofsalmonellawholegenomesequencedataavailableinpublicsequencedatabasesusingthesalmonellainsilicotypingresourcesistr
AT nashjohnhowardeagles comprehensiveassessmentofthequalityofsalmonellawholegenomesequencedataavailableinpublicsequencedatabasesusingthesalmonellainsilicotypingresourcesistr