Cargando…

Unsupervised outlier detection applied to SARS-CoV-2 nucleotide sequences can identify sequences of common variants and other variants of interest

As of June 2022, the GISAID database contains more than 11 million SARS-CoV-2 genomes, including several thousand nucleotide sequences for the most common variants such as delta or omicron. These SARS-CoV-2 strains have been collected from patients around the world since the beginning of the pandemi...

Descripción completa

Detalles Bibliográficos
Autores principales: Hahn, Georg, Lee, Sanghun, Prokopenko, Dmitry, Abraham, Jonathan, Novak, Tanya, Hecker, Julian, Cho, Michael, Khurana, Surender, Baden, Lindsey R., Randolph, Adrienne G., Weiss, Scott T., Lange, Christoph
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9761049/
https://www.ncbi.nlm.nih.gov/pubmed/36536276
http://dx.doi.org/10.1186/s12859-022-05105-y
_version_ 1784852622237237248
author Hahn, Georg
Lee, Sanghun
Prokopenko, Dmitry
Abraham, Jonathan
Novak, Tanya
Hecker, Julian
Cho, Michael
Khurana, Surender
Baden, Lindsey R.
Randolph, Adrienne G.
Weiss, Scott T.
Lange, Christoph
author_facet Hahn, Georg
Lee, Sanghun
Prokopenko, Dmitry
Abraham, Jonathan
Novak, Tanya
Hecker, Julian
Cho, Michael
Khurana, Surender
Baden, Lindsey R.
Randolph, Adrienne G.
Weiss, Scott T.
Lange, Christoph
author_sort Hahn, Georg
collection PubMed
description As of June 2022, the GISAID database contains more than 11 million SARS-CoV-2 genomes, including several thousand nucleotide sequences for the most common variants such as delta or omicron. These SARS-CoV-2 strains have been collected from patients around the world since the beginning of the pandemic. We start by assessing the similarity of all pairs of nucleotide sequences using the Jaccard index and principal component analysis. As shown previously in the literature, an unsupervised cluster analysis applied to the SARS-CoV-2 genomes results in clusters of sequences according to certain characteristics such as their strain or their clade. Importantly, we observe that nucleotide sequences of common variants are often outliers in clusters of sequences stemming from variants identified earlier on during the pandemic. Motivated by this finding, we are interested in applying outlier detection to nucleotide sequences. We demonstrate that nucleotide sequences of common variants (such as alpha, delta, or omicron) can be identified solely based on a statistical outlier criterion. We argue that outlier detection might be a useful surveillance tool to identify emerging variants in real time as the pandemic progresses. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-022-05105-y.
format Online
Article
Text
id pubmed-9761049
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-97610492022-12-19 Unsupervised outlier detection applied to SARS-CoV-2 nucleotide sequences can identify sequences of common variants and other variants of interest Hahn, Georg Lee, Sanghun Prokopenko, Dmitry Abraham, Jonathan Novak, Tanya Hecker, Julian Cho, Michael Khurana, Surender Baden, Lindsey R. Randolph, Adrienne G. Weiss, Scott T. Lange, Christoph BMC Bioinformatics Research As of June 2022, the GISAID database contains more than 11 million SARS-CoV-2 genomes, including several thousand nucleotide sequences for the most common variants such as delta or omicron. These SARS-CoV-2 strains have been collected from patients around the world since the beginning of the pandemic. We start by assessing the similarity of all pairs of nucleotide sequences using the Jaccard index and principal component analysis. As shown previously in the literature, an unsupervised cluster analysis applied to the SARS-CoV-2 genomes results in clusters of sequences according to certain characteristics such as their strain or their clade. Importantly, we observe that nucleotide sequences of common variants are often outliers in clusters of sequences stemming from variants identified earlier on during the pandemic. Motivated by this finding, we are interested in applying outlier detection to nucleotide sequences. We demonstrate that nucleotide sequences of common variants (such as alpha, delta, or omicron) can be identified solely based on a statistical outlier criterion. We argue that outlier detection might be a useful surveillance tool to identify emerging variants in real time as the pandemic progresses. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-022-05105-y. BioMed Central 2022-12-19 /pmc/articles/PMC9761049/ /pubmed/36536276 http://dx.doi.org/10.1186/s12859-022-05105-y Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Hahn, Georg
Lee, Sanghun
Prokopenko, Dmitry
Abraham, Jonathan
Novak, Tanya
Hecker, Julian
Cho, Michael
Khurana, Surender
Baden, Lindsey R.
Randolph, Adrienne G.
Weiss, Scott T.
Lange, Christoph
Unsupervised outlier detection applied to SARS-CoV-2 nucleotide sequences can identify sequences of common variants and other variants of interest
title Unsupervised outlier detection applied to SARS-CoV-2 nucleotide sequences can identify sequences of common variants and other variants of interest
title_full Unsupervised outlier detection applied to SARS-CoV-2 nucleotide sequences can identify sequences of common variants and other variants of interest
title_fullStr Unsupervised outlier detection applied to SARS-CoV-2 nucleotide sequences can identify sequences of common variants and other variants of interest
title_full_unstemmed Unsupervised outlier detection applied to SARS-CoV-2 nucleotide sequences can identify sequences of common variants and other variants of interest
title_short Unsupervised outlier detection applied to SARS-CoV-2 nucleotide sequences can identify sequences of common variants and other variants of interest
title_sort unsupervised outlier detection applied to sars-cov-2 nucleotide sequences can identify sequences of common variants and other variants of interest
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9761049/
https://www.ncbi.nlm.nih.gov/pubmed/36536276
http://dx.doi.org/10.1186/s12859-022-05105-y
work_keys_str_mv AT hahngeorg unsupervisedoutlierdetectionappliedtosarscov2nucleotidesequencescanidentifysequencesofcommonvariantsandothervariantsofinterest
AT leesanghun unsupervisedoutlierdetectionappliedtosarscov2nucleotidesequencescanidentifysequencesofcommonvariantsandothervariantsofinterest
AT prokopenkodmitry unsupervisedoutlierdetectionappliedtosarscov2nucleotidesequencescanidentifysequencesofcommonvariantsandothervariantsofinterest
AT abrahamjonathan unsupervisedoutlierdetectionappliedtosarscov2nucleotidesequencescanidentifysequencesofcommonvariantsandothervariantsofinterest
AT novaktanya unsupervisedoutlierdetectionappliedtosarscov2nucleotidesequencescanidentifysequencesofcommonvariantsandothervariantsofinterest
AT heckerjulian unsupervisedoutlierdetectionappliedtosarscov2nucleotidesequencescanidentifysequencesofcommonvariantsandothervariantsofinterest
AT chomichael unsupervisedoutlierdetectionappliedtosarscov2nucleotidesequencescanidentifysequencesofcommonvariantsandothervariantsofinterest
AT khuranasurender unsupervisedoutlierdetectionappliedtosarscov2nucleotidesequencescanidentifysequencesofcommonvariantsandothervariantsofinterest
AT badenlindseyr unsupervisedoutlierdetectionappliedtosarscov2nucleotidesequencescanidentifysequencesofcommonvariantsandothervariantsofinterest
AT randolphadrienneg unsupervisedoutlierdetectionappliedtosarscov2nucleotidesequencescanidentifysequencesofcommonvariantsandothervariantsofinterest
AT weissscottt unsupervisedoutlierdetectionappliedtosarscov2nucleotidesequencescanidentifysequencesofcommonvariantsandothervariantsofinterest
AT langechristoph unsupervisedoutlierdetectionappliedtosarscov2nucleotidesequencescanidentifysequencesofcommonvariantsandothervariantsofinterest