Cargando…

Semi-Supervised Pipeline for Autonomous Annotation of SARS-CoV-2 Genomes

SARS-CoV-2 genomic sequencing efforts have scaled dramatically to address the current global pandemic and aid public health. However, autonomous genome annotation of SARS-CoV-2 genes, proteins, and domains is not readily accomplished by existing methods and results in missing or incorrect sequences....

Descripción completa

Detalles Bibliográficos
Autores principales:	Beck, Kristen L., Seabolt, Edward, Agarwal, Akshay, Nayar, Gowri, Bianco, Simone, Krishnareddy, Harsha, Ngo, Timothy A., Kunitomi, Mark, Mukherjee, Vandana, Kaufman, James H.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2021
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8706859/ https://www.ncbi.nlm.nih.gov/pubmed/34960694 http://dx.doi.org/10.3390/v13122426

_version_	1784622294745743360
author	Beck, Kristen L. Seabolt, Edward Agarwal, Akshay Nayar, Gowri Bianco, Simone Krishnareddy, Harsha Ngo, Timothy A. Kunitomi, Mark Mukherjee, Vandana Kaufman, James H.
author_facet	Beck, Kristen L. Seabolt, Edward Agarwal, Akshay Nayar, Gowri Bianco, Simone Krishnareddy, Harsha Ngo, Timothy A. Kunitomi, Mark Mukherjee, Vandana Kaufman, James H.
author_sort	Beck, Kristen L.
collection	PubMed
description	SARS-CoV-2 genomic sequencing efforts have scaled dramatically to address the current global pandemic and aid public health. However, autonomous genome annotation of SARS-CoV-2 genes, proteins, and domains is not readily accomplished by existing methods and results in missing or incorrect sequences. To overcome this limitation, we developed a novel semi-supervised pipeline for automated gene, protein, and functional domain annotation of SARS-CoV-2 genomes that differentiates itself by not relying on the use of a single reference genome and by overcoming atypical genomic traits that challenge traditional bioinformatic methods. We analyzed an initial corpus of 66,000 SARS-CoV-2 genome sequences collected from labs across the world using our method and identified the comprehensive set of known proteins with 98.5% set membership accuracy and 99.1% accuracy in length prediction, compared to proteome references, including Replicase polyprotein 1ab (with its transcriptional slippage site). Compared to other published tools, such as Prokka (base) and VAPiD, we yielded a 6.4- and 1.8-fold increase in protein annotations. Our method generated 13,000,000 gene, protein, and domain sequences—some conserved across time and geography and others representing emerging variants. We observed 3362 non-redundant sequences per protein on average within this corpus and described key D614G and N501Y variants spatiotemporally in the initial genome corpus. For spike glycoprotein domains, we achieved greater than 97.9% sequence identity to references and characterized receptor binding domain variants. We further demonstrated the robustness and extensibility of our method on an additional 4000 variant diverse genomes containing all named variants of concern and interest as of August 2021. In this cohort, we successfully identified all keystone spike glycoprotein mutations in our predicted protein sequences with greater than 99% accuracy as well as demonstrating high accuracy of the protein and domain annotations. This work comprehensively presents the molecular targets to refine biomedical interventions for SARS-CoV-2 with a scalable, high-accuracy method to analyze newly sequenced infections as they arise.
format	Online Article Text
id	pubmed-8706859
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-87068592021-12-25 Semi-Supervised Pipeline for Autonomous Annotation of SARS-CoV-2 Genomes Beck, Kristen L. Seabolt, Edward Agarwal, Akshay Nayar, Gowri Bianco, Simone Krishnareddy, Harsha Ngo, Timothy A. Kunitomi, Mark Mukherjee, Vandana Kaufman, James H. Viruses Article SARS-CoV-2 genomic sequencing efforts have scaled dramatically to address the current global pandemic and aid public health. However, autonomous genome annotation of SARS-CoV-2 genes, proteins, and domains is not readily accomplished by existing methods and results in missing or incorrect sequences. To overcome this limitation, we developed a novel semi-supervised pipeline for automated gene, protein, and functional domain annotation of SARS-CoV-2 genomes that differentiates itself by not relying on the use of a single reference genome and by overcoming atypical genomic traits that challenge traditional bioinformatic methods. We analyzed an initial corpus of 66,000 SARS-CoV-2 genome sequences collected from labs across the world using our method and identified the comprehensive set of known proteins with 98.5% set membership accuracy and 99.1% accuracy in length prediction, compared to proteome references, including Replicase polyprotein 1ab (with its transcriptional slippage site). Compared to other published tools, such as Prokka (base) and VAPiD, we yielded a 6.4- and 1.8-fold increase in protein annotations. Our method generated 13,000,000 gene, protein, and domain sequences—some conserved across time and geography and others representing emerging variants. We observed 3362 non-redundant sequences per protein on average within this corpus and described key D614G and N501Y variants spatiotemporally in the initial genome corpus. For spike glycoprotein domains, we achieved greater than 97.9% sequence identity to references and characterized receptor binding domain variants. We further demonstrated the robustness and extensibility of our method on an additional 4000 variant diverse genomes containing all named variants of concern and interest as of August 2021. In this cohort, we successfully identified all keystone spike glycoprotein mutations in our predicted protein sequences with greater than 99% accuracy as well as demonstrating high accuracy of the protein and domain annotations. This work comprehensively presents the molecular targets to refine biomedical interventions for SARS-CoV-2 with a scalable, high-accuracy method to analyze newly sequenced infections as they arise. MDPI 2021-12-03 /pmc/articles/PMC8706859/ /pubmed/34960694 http://dx.doi.org/10.3390/v13122426 Text en © 2021 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Beck, Kristen L. Seabolt, Edward Agarwal, Akshay Nayar, Gowri Bianco, Simone Krishnareddy, Harsha Ngo, Timothy A. Kunitomi, Mark Mukherjee, Vandana Kaufman, James H. Semi-Supervised Pipeline for Autonomous Annotation of SARS-CoV-2 Genomes
title	Semi-Supervised Pipeline for Autonomous Annotation of SARS-CoV-2 Genomes
title_full	Semi-Supervised Pipeline for Autonomous Annotation of SARS-CoV-2 Genomes
title_fullStr	Semi-Supervised Pipeline for Autonomous Annotation of SARS-CoV-2 Genomes
title_full_unstemmed	Semi-Supervised Pipeline for Autonomous Annotation of SARS-CoV-2 Genomes
title_short	Semi-Supervised Pipeline for Autonomous Annotation of SARS-CoV-2 Genomes
title_sort	semi-supervised pipeline for autonomous annotation of sars-cov-2 genomes
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8706859/ https://www.ncbi.nlm.nih.gov/pubmed/34960694 http://dx.doi.org/10.3390/v13122426
work_keys_str_mv	AT beckkristenl semisupervisedpipelineforautonomousannotationofsarscov2genomes AT seaboltedward semisupervisedpipelineforautonomousannotationofsarscov2genomes AT agarwalakshay semisupervisedpipelineforautonomousannotationofsarscov2genomes AT nayargowri semisupervisedpipelineforautonomousannotationofsarscov2genomes AT biancosimone semisupervisedpipelineforautonomousannotationofsarscov2genomes AT krishnareddyharsha semisupervisedpipelineforautonomousannotationofsarscov2genomes AT ngotimothya semisupervisedpipelineforautonomousannotationofsarscov2genomes AT kunitomimark semisupervisedpipelineforautonomousannotationofsarscov2genomes AT mukherjeevandana semisupervisedpipelineforautonomousannotationofsarscov2genomes AT kaufmanjamesh semisupervisedpipelineforautonomousannotationofsarscov2genomes

Semi-Supervised Pipeline for Autonomous Annotation of SARS-CoV-2 Genomes

Ejemplares similares