Cargando…

A Schema for Digitized Surface Swab Site Metadata in Open-Source DNA Sequence Databases

Large, open-source DNA sequence databases have been generated, in part, through the collection of microbial pathogens by swabbing surfaces in built environments. Analyzing these data in aggregate through public health surveillance requires digitization of the complex, domain-specific metadata that a...

Descripción completa

Detalles Bibliográficos
Autores principales: Feng, Jingzhang, Daeschel, Devin, Dooley, Damion, Griffiths, Emma, Allard, Marc, Timme, Ruth, Chen, Yi, Snyder, Abigail B.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: American Society for Microbiology 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10134794/
https://www.ncbi.nlm.nih.gov/pubmed/36847566
http://dx.doi.org/10.1128/msystems.01284-22
_version_ 1785031832617615360
author Feng, Jingzhang
Daeschel, Devin
Dooley, Damion
Griffiths, Emma
Allard, Marc
Timme, Ruth
Chen, Yi
Snyder, Abigail B.
author_facet Feng, Jingzhang
Daeschel, Devin
Dooley, Damion
Griffiths, Emma
Allard, Marc
Timme, Ruth
Chen, Yi
Snyder, Abigail B.
author_sort Feng, Jingzhang
collection PubMed
description Large, open-source DNA sequence databases have been generated, in part, through the collection of microbial pathogens by swabbing surfaces in built environments. Analyzing these data in aggregate through public health surveillance requires digitization of the complex, domain-specific metadata that are associated with the swab site locations. However, the swab site location information is currently collected in a single, free-text, “isolation source”, field-promoting generation of poorly detailed descriptions with various word order, granularity, and linguistic errors, making automation difficult and reducing machine-actionability. We assessed 1,498 free-text swab site descriptions that were generated during routine foodborne pathogen surveillance. The lexicon of free-text metadata was evaluated to determine the informational facets and the quantity of unique terms used by data collectors. Open Biological Ontologies (OBO) Foundry libraries were used to develop hierarchical vocabularies that are connected with logical relationships to describe swab site locations. 5 informational facets that were described by 338 unique terms were identified via content analysis. Term hierarchy facets were developed, as were statements (called axioms) about how the entities within these five domains are related. The schema developed through this study has been integrated into a publicly available pathogen metadata standard, facilitating ongoing surveillance and investigations. The One Health Enteric Package was available at NCBI BioSample, beginning in 2022. The collective use of metadata standards increases the interoperability of DNA sequence databases and enables large-scale approaches to data sharing and artificial intelligence as well as big-data solutions to food safety. IMPORTANCE The regular analysis of whole-genome sequence data in collections such as NCBI’s Pathogen Detection Database is used by many public health organizations to detect outbreaks of infectious disease. However, isolate metadata in these databases are often incomplete and of poor quality. These complex, raw metadata must often be reorganized and manually formatted for use in aggregate analyses. These processes are inefficient and time-consuming, increasing the interpretative labor needed by public health groups to extract actionable information. The future use of open genomic epidemiology networks will be supported through the development of an internationally applicable vocabulary system with which swab site locations can be described.
format Online
Article
Text
id pubmed-10134794
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher American Society for Microbiology
record_format MEDLINE/PubMed
spelling pubmed-101347942023-04-28 A Schema for Digitized Surface Swab Site Metadata in Open-Source DNA Sequence Databases Feng, Jingzhang Daeschel, Devin Dooley, Damion Griffiths, Emma Allard, Marc Timme, Ruth Chen, Yi Snyder, Abigail B. mSystems Research Article Large, open-source DNA sequence databases have been generated, in part, through the collection of microbial pathogens by swabbing surfaces in built environments. Analyzing these data in aggregate through public health surveillance requires digitization of the complex, domain-specific metadata that are associated with the swab site locations. However, the swab site location information is currently collected in a single, free-text, “isolation source”, field-promoting generation of poorly detailed descriptions with various word order, granularity, and linguistic errors, making automation difficult and reducing machine-actionability. We assessed 1,498 free-text swab site descriptions that were generated during routine foodborne pathogen surveillance. The lexicon of free-text metadata was evaluated to determine the informational facets and the quantity of unique terms used by data collectors. Open Biological Ontologies (OBO) Foundry libraries were used to develop hierarchical vocabularies that are connected with logical relationships to describe swab site locations. 5 informational facets that were described by 338 unique terms were identified via content analysis. Term hierarchy facets were developed, as were statements (called axioms) about how the entities within these five domains are related. The schema developed through this study has been integrated into a publicly available pathogen metadata standard, facilitating ongoing surveillance and investigations. The One Health Enteric Package was available at NCBI BioSample, beginning in 2022. The collective use of metadata standards increases the interoperability of DNA sequence databases and enables large-scale approaches to data sharing and artificial intelligence as well as big-data solutions to food safety. IMPORTANCE The regular analysis of whole-genome sequence data in collections such as NCBI’s Pathogen Detection Database is used by many public health organizations to detect outbreaks of infectious disease. However, isolate metadata in these databases are often incomplete and of poor quality. These complex, raw metadata must often be reorganized and manually formatted for use in aggregate analyses. These processes are inefficient and time-consuming, increasing the interpretative labor needed by public health groups to extract actionable information. The future use of open genomic epidemiology networks will be supported through the development of an internationally applicable vocabulary system with which swab site locations can be described. American Society for Microbiology 2023-02-27 /pmc/articles/PMC10134794/ /pubmed/36847566 http://dx.doi.org/10.1128/msystems.01284-22 Text en Copyright © 2023 Feng et al. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International license (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Research Article
Feng, Jingzhang
Daeschel, Devin
Dooley, Damion
Griffiths, Emma
Allard, Marc
Timme, Ruth
Chen, Yi
Snyder, Abigail B.
A Schema for Digitized Surface Swab Site Metadata in Open-Source DNA Sequence Databases
title A Schema for Digitized Surface Swab Site Metadata in Open-Source DNA Sequence Databases
title_full A Schema for Digitized Surface Swab Site Metadata in Open-Source DNA Sequence Databases
title_fullStr A Schema for Digitized Surface Swab Site Metadata in Open-Source DNA Sequence Databases
title_full_unstemmed A Schema for Digitized Surface Swab Site Metadata in Open-Source DNA Sequence Databases
title_short A Schema for Digitized Surface Swab Site Metadata in Open-Source DNA Sequence Databases
title_sort schema for digitized surface swab site metadata in open-source dna sequence databases
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10134794/
https://www.ncbi.nlm.nih.gov/pubmed/36847566
http://dx.doi.org/10.1128/msystems.01284-22
work_keys_str_mv AT fengjingzhang aschemafordigitizedsurfaceswabsitemetadatainopensourcednasequencedatabases
AT daescheldevin aschemafordigitizedsurfaceswabsitemetadatainopensourcednasequencedatabases
AT dooleydamion aschemafordigitizedsurfaceswabsitemetadatainopensourcednasequencedatabases
AT griffithsemma aschemafordigitizedsurfaceswabsitemetadatainopensourcednasequencedatabases
AT allardmarc aschemafordigitizedsurfaceswabsitemetadatainopensourcednasequencedatabases
AT timmeruth aschemafordigitizedsurfaceswabsitemetadatainopensourcednasequencedatabases
AT chenyi aschemafordigitizedsurfaceswabsitemetadatainopensourcednasequencedatabases
AT snyderabigailb aschemafordigitizedsurfaceswabsitemetadatainopensourcednasequencedatabases
AT fengjingzhang schemafordigitizedsurfaceswabsitemetadatainopensourcednasequencedatabases
AT daescheldevin schemafordigitizedsurfaceswabsitemetadatainopensourcednasequencedatabases
AT dooleydamion schemafordigitizedsurfaceswabsitemetadatainopensourcednasequencedatabases
AT griffithsemma schemafordigitizedsurfaceswabsitemetadatainopensourcednasequencedatabases
AT allardmarc schemafordigitizedsurfaceswabsitemetadatainopensourcednasequencedatabases
AT timmeruth schemafordigitizedsurfaceswabsitemetadatainopensourcednasequencedatabases
AT chenyi schemafordigitizedsurfaceswabsitemetadatainopensourcednasequencedatabases
AT snyderabigailb schemafordigitizedsurfaceswabsitemetadatainopensourcednasequencedatabases