Cargando…
Unsupervised outlier detection applied to SARS-CoV-2 nucleotide sequences can identify sequences of common variants and other variants of interest
As of June 2022, the GISAID database contains more than 11 million SARS-CoV-2 genomes, including several thousand nucleotide sequences for the most common variants such as delta or omicron. These SARS-CoV-2 strains have been collected from patients around the world since the beginning of the pandemi...
Autores principales: | , , , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9761049/ https://www.ncbi.nlm.nih.gov/pubmed/36536276 http://dx.doi.org/10.1186/s12859-022-05105-y |
_version_ | 1784852622237237248 |
---|---|
author | Hahn, Georg Lee, Sanghun Prokopenko, Dmitry Abraham, Jonathan Novak, Tanya Hecker, Julian Cho, Michael Khurana, Surender Baden, Lindsey R. Randolph, Adrienne G. Weiss, Scott T. Lange, Christoph |
author_facet | Hahn, Georg Lee, Sanghun Prokopenko, Dmitry Abraham, Jonathan Novak, Tanya Hecker, Julian Cho, Michael Khurana, Surender Baden, Lindsey R. Randolph, Adrienne G. Weiss, Scott T. Lange, Christoph |
author_sort | Hahn, Georg |
collection | PubMed |
description | As of June 2022, the GISAID database contains more than 11 million SARS-CoV-2 genomes, including several thousand nucleotide sequences for the most common variants such as delta or omicron. These SARS-CoV-2 strains have been collected from patients around the world since the beginning of the pandemic. We start by assessing the similarity of all pairs of nucleotide sequences using the Jaccard index and principal component analysis. As shown previously in the literature, an unsupervised cluster analysis applied to the SARS-CoV-2 genomes results in clusters of sequences according to certain characteristics such as their strain or their clade. Importantly, we observe that nucleotide sequences of common variants are often outliers in clusters of sequences stemming from variants identified earlier on during the pandemic. Motivated by this finding, we are interested in applying outlier detection to nucleotide sequences. We demonstrate that nucleotide sequences of common variants (such as alpha, delta, or omicron) can be identified solely based on a statistical outlier criterion. We argue that outlier detection might be a useful surveillance tool to identify emerging variants in real time as the pandemic progresses. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-022-05105-y. |
format | Online Article Text |
id | pubmed-9761049 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-97610492022-12-19 Unsupervised outlier detection applied to SARS-CoV-2 nucleotide sequences can identify sequences of common variants and other variants of interest Hahn, Georg Lee, Sanghun Prokopenko, Dmitry Abraham, Jonathan Novak, Tanya Hecker, Julian Cho, Michael Khurana, Surender Baden, Lindsey R. Randolph, Adrienne G. Weiss, Scott T. Lange, Christoph BMC Bioinformatics Research As of June 2022, the GISAID database contains more than 11 million SARS-CoV-2 genomes, including several thousand nucleotide sequences for the most common variants such as delta or omicron. These SARS-CoV-2 strains have been collected from patients around the world since the beginning of the pandemic. We start by assessing the similarity of all pairs of nucleotide sequences using the Jaccard index and principal component analysis. As shown previously in the literature, an unsupervised cluster analysis applied to the SARS-CoV-2 genomes results in clusters of sequences according to certain characteristics such as their strain or their clade. Importantly, we observe that nucleotide sequences of common variants are often outliers in clusters of sequences stemming from variants identified earlier on during the pandemic. Motivated by this finding, we are interested in applying outlier detection to nucleotide sequences. We demonstrate that nucleotide sequences of common variants (such as alpha, delta, or omicron) can be identified solely based on a statistical outlier criterion. We argue that outlier detection might be a useful surveillance tool to identify emerging variants in real time as the pandemic progresses. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-022-05105-y. BioMed Central 2022-12-19 /pmc/articles/PMC9761049/ /pubmed/36536276 http://dx.doi.org/10.1186/s12859-022-05105-y Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Research Hahn, Georg Lee, Sanghun Prokopenko, Dmitry Abraham, Jonathan Novak, Tanya Hecker, Julian Cho, Michael Khurana, Surender Baden, Lindsey R. Randolph, Adrienne G. Weiss, Scott T. Lange, Christoph Unsupervised outlier detection applied to SARS-CoV-2 nucleotide sequences can identify sequences of common variants and other variants of interest |
title | Unsupervised outlier detection applied to SARS-CoV-2 nucleotide sequences can identify sequences of common variants and other variants of interest |
title_full | Unsupervised outlier detection applied to SARS-CoV-2 nucleotide sequences can identify sequences of common variants and other variants of interest |
title_fullStr | Unsupervised outlier detection applied to SARS-CoV-2 nucleotide sequences can identify sequences of common variants and other variants of interest |
title_full_unstemmed | Unsupervised outlier detection applied to SARS-CoV-2 nucleotide sequences can identify sequences of common variants and other variants of interest |
title_short | Unsupervised outlier detection applied to SARS-CoV-2 nucleotide sequences can identify sequences of common variants and other variants of interest |
title_sort | unsupervised outlier detection applied to sars-cov-2 nucleotide sequences can identify sequences of common variants and other variants of interest |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9761049/ https://www.ncbi.nlm.nih.gov/pubmed/36536276 http://dx.doi.org/10.1186/s12859-022-05105-y |
work_keys_str_mv | AT hahngeorg unsupervisedoutlierdetectionappliedtosarscov2nucleotidesequencescanidentifysequencesofcommonvariantsandothervariantsofinterest AT leesanghun unsupervisedoutlierdetectionappliedtosarscov2nucleotidesequencescanidentifysequencesofcommonvariantsandothervariantsofinterest AT prokopenkodmitry unsupervisedoutlierdetectionappliedtosarscov2nucleotidesequencescanidentifysequencesofcommonvariantsandothervariantsofinterest AT abrahamjonathan unsupervisedoutlierdetectionappliedtosarscov2nucleotidesequencescanidentifysequencesofcommonvariantsandothervariantsofinterest AT novaktanya unsupervisedoutlierdetectionappliedtosarscov2nucleotidesequencescanidentifysequencesofcommonvariantsandothervariantsofinterest AT heckerjulian unsupervisedoutlierdetectionappliedtosarscov2nucleotidesequencescanidentifysequencesofcommonvariantsandothervariantsofinterest AT chomichael unsupervisedoutlierdetectionappliedtosarscov2nucleotidesequencescanidentifysequencesofcommonvariantsandothervariantsofinterest AT khuranasurender unsupervisedoutlierdetectionappliedtosarscov2nucleotidesequencescanidentifysequencesofcommonvariantsandothervariantsofinterest AT badenlindseyr unsupervisedoutlierdetectionappliedtosarscov2nucleotidesequencescanidentifysequencesofcommonvariantsandothervariantsofinterest AT randolphadrienneg unsupervisedoutlierdetectionappliedtosarscov2nucleotidesequencescanidentifysequencesofcommonvariantsandothervariantsofinterest AT weissscottt unsupervisedoutlierdetectionappliedtosarscov2nucleotidesequencescanidentifysequencesofcommonvariantsandothervariantsofinterest AT langechristoph unsupervisedoutlierdetectionappliedtosarscov2nucleotidesequencescanidentifysequencesofcommonvariantsandothervariantsofinterest |