Cargando…

A search-based geographic metadata curation pipeline to refine sequencing institution information and support public health

BACKGROUND: The National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) has amassed a vast reservoir of genetic data since its inception in 2007. These public data hold immense potential for supporting pathogen surveillance and control. However, the lack of standardized meta...

Descripción completa

Detalles Bibliográficos
Autores principales: Zhao, Kun, Farrell, Katie, Mashiku, Melchizedek, Abay, Dawit, Tang, Kevin, Oberste, M. Steven, Burns, Cara C.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10683794/
https://www.ncbi.nlm.nih.gov/pubmed/38035280
http://dx.doi.org/10.3389/fpubh.2023.1254976
_version_ 1785151267820011520
author Zhao, Kun
Farrell, Katie
Mashiku, Melchizedek
Abay, Dawit
Tang, Kevin
Oberste, M. Steven
Burns, Cara C.
author_facet Zhao, Kun
Farrell, Katie
Mashiku, Melchizedek
Abay, Dawit
Tang, Kevin
Oberste, M. Steven
Burns, Cara C.
author_sort Zhao, Kun
collection PubMed
description BACKGROUND: The National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) has amassed a vast reservoir of genetic data since its inception in 2007. These public data hold immense potential for supporting pathogen surveillance and control. However, the lack of standardized metadata and inconsistent submission practices in SRA may impede the data’s utility in public health. METHODS: To address this issue, we introduce the Search-based Geographic Metadata Curation (SGMC) pipeline. SGMC utilized Python and web scraping to extract geographic data of sequencing institutions from NCBI SRA in the Cloud and its website. It then harnessed ChatGPT to refine the sequencing institution and location assignments. To illustrate the pipeline’s utility, we examined the geographic distribution of the sequencing institutions and their countries relevant to polio eradication and categorized them. RESULTS: SGMC successfully identified 7,649 sequencing institutions and their global locations from a random selection of 2,321,044 SRA accessions. These institutions were distributed across 97 countries, with strong representation in the United States, the United Kingdom and China. However, there was a lack of data from African, Central Asian, and Central American countries, indicating potential disparities in sequencing capabilities. Comparison with manually curated data for U.S. institutions reveals SGMC’s accuracy rates of 94.8% for institutions, 93.1% for countries, and 74.5% for geographic coordinates. CONCLUSION: SGMC may represent a novel approach using a generative AI model to enhance geographic data (country and institution assignments) for large numbers of samples within SRA datasets. This information can be utilized to bolster public health endeavors.
format Online
Article
Text
id pubmed-10683794
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-106837942023-11-30 A search-based geographic metadata curation pipeline to refine sequencing institution information and support public health Zhao, Kun Farrell, Katie Mashiku, Melchizedek Abay, Dawit Tang, Kevin Oberste, M. Steven Burns, Cara C. Front Public Health Public Health BACKGROUND: The National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) has amassed a vast reservoir of genetic data since its inception in 2007. These public data hold immense potential for supporting pathogen surveillance and control. However, the lack of standardized metadata and inconsistent submission practices in SRA may impede the data’s utility in public health. METHODS: To address this issue, we introduce the Search-based Geographic Metadata Curation (SGMC) pipeline. SGMC utilized Python and web scraping to extract geographic data of sequencing institutions from NCBI SRA in the Cloud and its website. It then harnessed ChatGPT to refine the sequencing institution and location assignments. To illustrate the pipeline’s utility, we examined the geographic distribution of the sequencing institutions and their countries relevant to polio eradication and categorized them. RESULTS: SGMC successfully identified 7,649 sequencing institutions and their global locations from a random selection of 2,321,044 SRA accessions. These institutions were distributed across 97 countries, with strong representation in the United States, the United Kingdom and China. However, there was a lack of data from African, Central Asian, and Central American countries, indicating potential disparities in sequencing capabilities. Comparison with manually curated data for U.S. institutions reveals SGMC’s accuracy rates of 94.8% for institutions, 93.1% for countries, and 74.5% for geographic coordinates. CONCLUSION: SGMC may represent a novel approach using a generative AI model to enhance geographic data (country and institution assignments) for large numbers of samples within SRA datasets. This information can be utilized to bolster public health endeavors. Frontiers Media S.A. 2023-11-14 /pmc/articles/PMC10683794/ /pubmed/38035280 http://dx.doi.org/10.3389/fpubh.2023.1254976 Text en Copyright © 2023 Zhao, Farrell, Mashiku, Abay, Tang, Oberste and Burns. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Public Health
Zhao, Kun
Farrell, Katie
Mashiku, Melchizedek
Abay, Dawit
Tang, Kevin
Oberste, M. Steven
Burns, Cara C.
A search-based geographic metadata curation pipeline to refine sequencing institution information and support public health
title A search-based geographic metadata curation pipeline to refine sequencing institution information and support public health
title_full A search-based geographic metadata curation pipeline to refine sequencing institution information and support public health
title_fullStr A search-based geographic metadata curation pipeline to refine sequencing institution information and support public health
title_full_unstemmed A search-based geographic metadata curation pipeline to refine sequencing institution information and support public health
title_short A search-based geographic metadata curation pipeline to refine sequencing institution information and support public health
title_sort search-based geographic metadata curation pipeline to refine sequencing institution information and support public health
topic Public Health
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10683794/
https://www.ncbi.nlm.nih.gov/pubmed/38035280
http://dx.doi.org/10.3389/fpubh.2023.1254976
work_keys_str_mv AT zhaokun asearchbasedgeographicmetadatacurationpipelinetorefinesequencinginstitutioninformationandsupportpublichealth
AT farrellkatie asearchbasedgeographicmetadatacurationpipelinetorefinesequencinginstitutioninformationandsupportpublichealth
AT mashikumelchizedek asearchbasedgeographicmetadatacurationpipelinetorefinesequencinginstitutioninformationandsupportpublichealth
AT abaydawit asearchbasedgeographicmetadatacurationpipelinetorefinesequencinginstitutioninformationandsupportpublichealth
AT tangkevin asearchbasedgeographicmetadatacurationpipelinetorefinesequencinginstitutioninformationandsupportpublichealth
AT oberstemsteven asearchbasedgeographicmetadatacurationpipelinetorefinesequencinginstitutioninformationandsupportpublichealth
AT burnscarac asearchbasedgeographicmetadatacurationpipelinetorefinesequencinginstitutioninformationandsupportpublichealth
AT zhaokun searchbasedgeographicmetadatacurationpipelinetorefinesequencinginstitutioninformationandsupportpublichealth
AT farrellkatie searchbasedgeographicmetadatacurationpipelinetorefinesequencinginstitutioninformationandsupportpublichealth
AT mashikumelchizedek searchbasedgeographicmetadatacurationpipelinetorefinesequencinginstitutioninformationandsupportpublichealth
AT abaydawit searchbasedgeographicmetadatacurationpipelinetorefinesequencinginstitutioninformationandsupportpublichealth
AT tangkevin searchbasedgeographicmetadatacurationpipelinetorefinesequencinginstitutioninformationandsupportpublichealth
AT oberstemsteven searchbasedgeographicmetadatacurationpipelinetorefinesequencinginstitutioninformationandsupportpublichealth
AT burnscarac searchbasedgeographicmetadatacurationpipelinetorefinesequencinginstitutioninformationandsupportpublichealth