Cargando…

Analysis of 329,942 SARS-CoV-2 records retrieved from GISAID database

BACKGROUND: The SARS-CoV-2 virus caused a worldwide pandemic – although none of its predecessors from the coronavirus family ever achieved such a scale. The key to understanding the global success of SARS-CoV-2 is hidden in its genome. MATERIALS AND METHODS: We retrieved data for 329,942 SARS-CoV-2...

Descripción completa

Detalles Bibliográficos
Autores principales: Zelenova, Maria, Ivanova, Anna, Semyonov, Semyon, Gankin, Yuriy
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier Ltd. 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8547852/
https://www.ncbi.nlm.nih.gov/pubmed/34735950
http://dx.doi.org/10.1016/j.compbiomed.2021.104981
_version_ 1784590460504768512
author Zelenova, Maria
Ivanova, Anna
Semyonov, Semyon
Gankin, Yuriy
author_facet Zelenova, Maria
Ivanova, Anna
Semyonov, Semyon
Gankin, Yuriy
author_sort Zelenova, Maria
collection PubMed
description BACKGROUND: The SARS-CoV-2 virus caused a worldwide pandemic – although none of its predecessors from the coronavirus family ever achieved such a scale. The key to understanding the global success of SARS-CoV-2 is hidden in its genome. MATERIALS AND METHODS: We retrieved data for 329,942 SARS-CoV-2 records uploaded to the GISAID database from the beginning of the pandemic until the January 8, 2021. A Python variant detection script was developed to process the data using pairwise2 from the BioPython library. Sequence alignments were performed for every gene separately (except ORF1ab, which was not studied). Genomes less than 26,000 nucleotides long were excluded from the research. Clustering was performed using HDBScan. RESULTS: Here, we addressed the genetic variability of SARS-CoV-2 using 329,942 samples. The analysis yielded 155 SNPs and deletions in more than 0.3% of the sequences. Clustering results suggested that a proportion of people (2.46%) was infected with a distinct subtype of the B.1.1.7 variant, which contained four to six additional mutations (G28881A, G28882A, G28883С, A23403G, A28095T, G25437T). Two clusters were formed by mutations in the samples uploaded predominantly by Denmark and Australia (1.48% and 2.51%, respectively). A correlation coefficient matrix detected 160 pairs of mutations (correlation coefficient greater than 0.7). We also addressed the completeness of the GISAID database, patient gender, and age. Finally, we found ORF6 and E to be the most conserved genes (96.15% and 94.66% of the sequences totally match the reference, respectively). Our results indicate multiple areas for further research in both SARS-CoV-2 studies and health science.
format Online
Article
Text
id pubmed-8547852
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Elsevier Ltd.
record_format MEDLINE/PubMed
spelling pubmed-85478522021-10-27 Analysis of 329,942 SARS-CoV-2 records retrieved from GISAID database Zelenova, Maria Ivanova, Anna Semyonov, Semyon Gankin, Yuriy Comput Biol Med Article BACKGROUND: The SARS-CoV-2 virus caused a worldwide pandemic – although none of its predecessors from the coronavirus family ever achieved such a scale. The key to understanding the global success of SARS-CoV-2 is hidden in its genome. MATERIALS AND METHODS: We retrieved data for 329,942 SARS-CoV-2 records uploaded to the GISAID database from the beginning of the pandemic until the January 8, 2021. A Python variant detection script was developed to process the data using pairwise2 from the BioPython library. Sequence alignments were performed for every gene separately (except ORF1ab, which was not studied). Genomes less than 26,000 nucleotides long were excluded from the research. Clustering was performed using HDBScan. RESULTS: Here, we addressed the genetic variability of SARS-CoV-2 using 329,942 samples. The analysis yielded 155 SNPs and deletions in more than 0.3% of the sequences. Clustering results suggested that a proportion of people (2.46%) was infected with a distinct subtype of the B.1.1.7 variant, which contained four to six additional mutations (G28881A, G28882A, G28883С, A23403G, A28095T, G25437T). Two clusters were formed by mutations in the samples uploaded predominantly by Denmark and Australia (1.48% and 2.51%, respectively). A correlation coefficient matrix detected 160 pairs of mutations (correlation coefficient greater than 0.7). We also addressed the completeness of the GISAID database, patient gender, and age. Finally, we found ORF6 and E to be the most conserved genes (96.15% and 94.66% of the sequences totally match the reference, respectively). Our results indicate multiple areas for further research in both SARS-CoV-2 studies and health science. Elsevier Ltd. 2021-12 2021-10-26 /pmc/articles/PMC8547852/ /pubmed/34735950 http://dx.doi.org/10.1016/j.compbiomed.2021.104981 Text en © 2021 Elsevier Ltd. All rights reserved. Since January 2020 Elsevier has created a COVID-19 resource centre with free information in English and Mandarin on the novel coronavirus COVID-19. The COVID-19 resource centre is hosted on Elsevier Connect, the company's public news and information website. Elsevier hereby grants permission to make all its COVID-19-related research that is available on the COVID-19 resource centre - including this research content - immediately available in PubMed Central and other publicly funded repositories, such as the WHO COVID database with rights for unrestricted research re-use and analyses in any form or by any means with acknowledgement of the original source. These permissions are granted for free by Elsevier for as long as the COVID-19 resource centre remains active.
spellingShingle Article
Zelenova, Maria
Ivanova, Anna
Semyonov, Semyon
Gankin, Yuriy
Analysis of 329,942 SARS-CoV-2 records retrieved from GISAID database
title Analysis of 329,942 SARS-CoV-2 records retrieved from GISAID database
title_full Analysis of 329,942 SARS-CoV-2 records retrieved from GISAID database
title_fullStr Analysis of 329,942 SARS-CoV-2 records retrieved from GISAID database
title_full_unstemmed Analysis of 329,942 SARS-CoV-2 records retrieved from GISAID database
title_short Analysis of 329,942 SARS-CoV-2 records retrieved from GISAID database
title_sort analysis of 329,942 sars-cov-2 records retrieved from gisaid database
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8547852/
https://www.ncbi.nlm.nih.gov/pubmed/34735950
http://dx.doi.org/10.1016/j.compbiomed.2021.104981
work_keys_str_mv AT zelenovamaria analysisof329942sarscov2recordsretrievedfromgisaiddatabase
AT ivanovaanna analysisof329942sarscov2recordsretrievedfromgisaiddatabase
AT semyonovsemyon analysisof329942sarscov2recordsretrievedfromgisaiddatabase
AT gankinyuriy analysisof329942sarscov2recordsretrievedfromgisaiddatabase