Cargando…

Benchmark datasets for SARS-CoV-2 surveillance bioinformatics

BACKGROUND: Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the cause of coronavirus disease 2019 (COVID-19), has spread globally and is being surveilled with an international genome sequencing effort. Surveillance consists of sample acquisition, library preparation, and whole genome s...

Descripción completa

Detalles Bibliográficos
Autores principales: Xiaoli, Lingzi, Hagey, Jill V., Park, Daniel J., Gulvik, Christopher A., Young, Erin L., Alikhan, Nabil-Fareed, Lawsin, Adrian, Hassell, Norman, Knipe, Kristen, Oakeson, Kelly F., Retchless, Adam C., Shakya, Migun, Lo, Chien-Chi, Chain, Patrick, Page, Andrew J., Metcalf, Benjamin J., Su, Michelle, Rowell, Jessica, Vidyaprakash, Eshaw, Paden, Clinton R., Huang, Andrew D., Roellig, Dawn, Patel, Ketan, Winglee, Kathryn, Weigand, Michael R., Katz, Lee S.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9454940/
https://www.ncbi.nlm.nih.gov/pubmed/36093336
http://dx.doi.org/10.7717/peerj.13821
_version_ 1784785470654251008
author Xiaoli, Lingzi
Hagey, Jill V.
Park, Daniel J.
Gulvik, Christopher A.
Young, Erin L.
Alikhan, Nabil-Fareed
Lawsin, Adrian
Hassell, Norman
Knipe, Kristen
Oakeson, Kelly F.
Retchless, Adam C.
Shakya, Migun
Lo, Chien-Chi
Chain, Patrick
Page, Andrew J.
Metcalf, Benjamin J.
Su, Michelle
Rowell, Jessica
Vidyaprakash, Eshaw
Paden, Clinton R.
Huang, Andrew D.
Roellig, Dawn
Patel, Ketan
Winglee, Kathryn
Weigand, Michael R.
Katz, Lee S.
author_facet Xiaoli, Lingzi
Hagey, Jill V.
Park, Daniel J.
Gulvik, Christopher A.
Young, Erin L.
Alikhan, Nabil-Fareed
Lawsin, Adrian
Hassell, Norman
Knipe, Kristen
Oakeson, Kelly F.
Retchless, Adam C.
Shakya, Migun
Lo, Chien-Chi
Chain, Patrick
Page, Andrew J.
Metcalf, Benjamin J.
Su, Michelle
Rowell, Jessica
Vidyaprakash, Eshaw
Paden, Clinton R.
Huang, Andrew D.
Roellig, Dawn
Patel, Ketan
Winglee, Kathryn
Weigand, Michael R.
Katz, Lee S.
author_sort Xiaoli, Lingzi
collection PubMed
description BACKGROUND: Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the cause of coronavirus disease 2019 (COVID-19), has spread globally and is being surveilled with an international genome sequencing effort. Surveillance consists of sample acquisition, library preparation, and whole genome sequencing. This has necessitated a classification scheme detailing Variants of Concern (VOC) and Variants of Interest (VOI), and the rapid expansion of bioinformatics tools for sequence analysis. These bioinformatic tools are means for major actionable results: maintaining quality assurance and checks, defining population structure, performing genomic epidemiology, and inferring lineage to allow reliable and actionable identification and classification. Additionally, the pandemic has required public health laboratories to reach high throughput proficiency in sequencing library preparation and downstream data analysis rapidly. However, both processes can be limited by a lack of a standardized sequence dataset. METHODS: We identified six SARS-CoV-2 sequence datasets from recent publications, public databases and internal resources. In addition, we created a method to mine public databases to identify representative genomes for these datasets. Using this novel method, we identified several genomes as either VOI/VOC representatives or non-VOI/VOC representatives. To describe each dataset, we utilized a previously published datasets format, which describes accession information and whole dataset information. Additionally, a script from the same publication has been enhanced to download and verify all data from this study. RESULTS: The benchmark datasets focus on the two most widely used sequencing platforms: long read sequencing data from the Oxford Nanopore Technologies platform and short read sequencing data from the Illumina platform. There are six datasets: three were derived from recent publications; two were derived from data mining public databases to answer common questions not covered by published datasets; one unique dataset representing common sequence failures was obtained by rigorously scrutinizing data that did not pass quality checks. The dataset summary table, data mining script and quality control (QC) values for all sequence data are publicly available on GitHub: https://github.com/CDCgov/datasets-sars-cov-2. DISCUSSION: The datasets presented here were generated to help public health laboratories build sequencing and bioinformatics capacity, benchmark different workflows and pipelines, and calibrate QC thresholds to ensure sequencing quality. Together, improvements in these areas support accurate and timely outbreak investigation and surveillance, providing actionable data for pandemic management. Furthermore, these publicly available and standardized benchmark data will facilitate the development and adjudication of new pipelines.
format Online
Article
Text
id pubmed-9454940
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher PeerJ Inc.
record_format MEDLINE/PubMed
spelling pubmed-94549402022-09-09 Benchmark datasets for SARS-CoV-2 surveillance bioinformatics Xiaoli, Lingzi Hagey, Jill V. Park, Daniel J. Gulvik, Christopher A. Young, Erin L. Alikhan, Nabil-Fareed Lawsin, Adrian Hassell, Norman Knipe, Kristen Oakeson, Kelly F. Retchless, Adam C. Shakya, Migun Lo, Chien-Chi Chain, Patrick Page, Andrew J. Metcalf, Benjamin J. Su, Michelle Rowell, Jessica Vidyaprakash, Eshaw Paden, Clinton R. Huang, Andrew D. Roellig, Dawn Patel, Ketan Winglee, Kathryn Weigand, Michael R. Katz, Lee S. PeerJ Bioinformatics BACKGROUND: Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the cause of coronavirus disease 2019 (COVID-19), has spread globally and is being surveilled with an international genome sequencing effort. Surveillance consists of sample acquisition, library preparation, and whole genome sequencing. This has necessitated a classification scheme detailing Variants of Concern (VOC) and Variants of Interest (VOI), and the rapid expansion of bioinformatics tools for sequence analysis. These bioinformatic tools are means for major actionable results: maintaining quality assurance and checks, defining population structure, performing genomic epidemiology, and inferring lineage to allow reliable and actionable identification and classification. Additionally, the pandemic has required public health laboratories to reach high throughput proficiency in sequencing library preparation and downstream data analysis rapidly. However, both processes can be limited by a lack of a standardized sequence dataset. METHODS: We identified six SARS-CoV-2 sequence datasets from recent publications, public databases and internal resources. In addition, we created a method to mine public databases to identify representative genomes for these datasets. Using this novel method, we identified several genomes as either VOI/VOC representatives or non-VOI/VOC representatives. To describe each dataset, we utilized a previously published datasets format, which describes accession information and whole dataset information. Additionally, a script from the same publication has been enhanced to download and verify all data from this study. RESULTS: The benchmark datasets focus on the two most widely used sequencing platforms: long read sequencing data from the Oxford Nanopore Technologies platform and short read sequencing data from the Illumina platform. There are six datasets: three were derived from recent publications; two were derived from data mining public databases to answer common questions not covered by published datasets; one unique dataset representing common sequence failures was obtained by rigorously scrutinizing data that did not pass quality checks. The dataset summary table, data mining script and quality control (QC) values for all sequence data are publicly available on GitHub: https://github.com/CDCgov/datasets-sars-cov-2. DISCUSSION: The datasets presented here were generated to help public health laboratories build sequencing and bioinformatics capacity, benchmark different workflows and pipelines, and calibrate QC thresholds to ensure sequencing quality. Together, improvements in these areas support accurate and timely outbreak investigation and surveillance, providing actionable data for pandemic management. Furthermore, these publicly available and standardized benchmark data will facilitate the development and adjudication of new pipelines. PeerJ Inc. 2022-09-05 /pmc/articles/PMC9454940/ /pubmed/36093336 http://dx.doi.org/10.7717/peerj.13821 Text en ©2022 Xiaoli et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.
spellingShingle Bioinformatics
Xiaoli, Lingzi
Hagey, Jill V.
Park, Daniel J.
Gulvik, Christopher A.
Young, Erin L.
Alikhan, Nabil-Fareed
Lawsin, Adrian
Hassell, Norman
Knipe, Kristen
Oakeson, Kelly F.
Retchless, Adam C.
Shakya, Migun
Lo, Chien-Chi
Chain, Patrick
Page, Andrew J.
Metcalf, Benjamin J.
Su, Michelle
Rowell, Jessica
Vidyaprakash, Eshaw
Paden, Clinton R.
Huang, Andrew D.
Roellig, Dawn
Patel, Ketan
Winglee, Kathryn
Weigand, Michael R.
Katz, Lee S.
Benchmark datasets for SARS-CoV-2 surveillance bioinformatics
title Benchmark datasets for SARS-CoV-2 surveillance bioinformatics
title_full Benchmark datasets for SARS-CoV-2 surveillance bioinformatics
title_fullStr Benchmark datasets for SARS-CoV-2 surveillance bioinformatics
title_full_unstemmed Benchmark datasets for SARS-CoV-2 surveillance bioinformatics
title_short Benchmark datasets for SARS-CoV-2 surveillance bioinformatics
title_sort benchmark datasets for sars-cov-2 surveillance bioinformatics
topic Bioinformatics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9454940/
https://www.ncbi.nlm.nih.gov/pubmed/36093336
http://dx.doi.org/10.7717/peerj.13821
work_keys_str_mv AT xiaolilingzi benchmarkdatasetsforsarscov2surveillancebioinformatics
AT hageyjillv benchmarkdatasetsforsarscov2surveillancebioinformatics
AT parkdanielj benchmarkdatasetsforsarscov2surveillancebioinformatics
AT gulvikchristophera benchmarkdatasetsforsarscov2surveillancebioinformatics
AT youngerinl benchmarkdatasetsforsarscov2surveillancebioinformatics
AT alikhannabilfareed benchmarkdatasetsforsarscov2surveillancebioinformatics
AT lawsinadrian benchmarkdatasetsforsarscov2surveillancebioinformatics
AT hassellnorman benchmarkdatasetsforsarscov2surveillancebioinformatics
AT knipekristen benchmarkdatasetsforsarscov2surveillancebioinformatics
AT oakesonkellyf benchmarkdatasetsforsarscov2surveillancebioinformatics
AT retchlessadamc benchmarkdatasetsforsarscov2surveillancebioinformatics
AT shakyamigun benchmarkdatasetsforsarscov2surveillancebioinformatics
AT lochienchi benchmarkdatasetsforsarscov2surveillancebioinformatics
AT chainpatrick benchmarkdatasetsforsarscov2surveillancebioinformatics
AT pageandrewj benchmarkdatasetsforsarscov2surveillancebioinformatics
AT metcalfbenjaminj benchmarkdatasetsforsarscov2surveillancebioinformatics
AT sumichelle benchmarkdatasetsforsarscov2surveillancebioinformatics
AT rowelljessica benchmarkdatasetsforsarscov2surveillancebioinformatics
AT vidyaprakasheshaw benchmarkdatasetsforsarscov2surveillancebioinformatics
AT padenclintonr benchmarkdatasetsforsarscov2surveillancebioinformatics
AT huangandrewd benchmarkdatasetsforsarscov2surveillancebioinformatics
AT roelligdawn benchmarkdatasetsforsarscov2surveillancebioinformatics
AT patelketan benchmarkdatasetsforsarscov2surveillancebioinformatics
AT wingleekathryn benchmarkdatasetsforsarscov2surveillancebioinformatics
AT weigandmichaelr benchmarkdatasetsforsarscov2surveillancebioinformatics
AT katzlees benchmarkdatasetsforsarscov2surveillancebioinformatics