Cargando…
Analysis of 329,942 SARS-CoV-2 records retrieved from GISAID database
BACKGROUND: The SARS-CoV-2 virus caused a worldwide pandemic – although none of its predecessors from the coronavirus family ever achieved such a scale. The key to understanding the global success of SARS-CoV-2 is hidden in its genome. MATERIALS AND METHODS: We retrieved data for 329,942 SARS-CoV-2...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Elsevier Ltd.
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8547852/ https://www.ncbi.nlm.nih.gov/pubmed/34735950 http://dx.doi.org/10.1016/j.compbiomed.2021.104981 |
_version_ | 1784590460504768512 |
---|---|
author | Zelenova, Maria Ivanova, Anna Semyonov, Semyon Gankin, Yuriy |
author_facet | Zelenova, Maria Ivanova, Anna Semyonov, Semyon Gankin, Yuriy |
author_sort | Zelenova, Maria |
collection | PubMed |
description | BACKGROUND: The SARS-CoV-2 virus caused a worldwide pandemic – although none of its predecessors from the coronavirus family ever achieved such a scale. The key to understanding the global success of SARS-CoV-2 is hidden in its genome. MATERIALS AND METHODS: We retrieved data for 329,942 SARS-CoV-2 records uploaded to the GISAID database from the beginning of the pandemic until the January 8, 2021. A Python variant detection script was developed to process the data using pairwise2 from the BioPython library. Sequence alignments were performed for every gene separately (except ORF1ab, which was not studied). Genomes less than 26,000 nucleotides long were excluded from the research. Clustering was performed using HDBScan. RESULTS: Here, we addressed the genetic variability of SARS-CoV-2 using 329,942 samples. The analysis yielded 155 SNPs and deletions in more than 0.3% of the sequences. Clustering results suggested that a proportion of people (2.46%) was infected with a distinct subtype of the B.1.1.7 variant, which contained four to six additional mutations (G28881A, G28882A, G28883С, A23403G, A28095T, G25437T). Two clusters were formed by mutations in the samples uploaded predominantly by Denmark and Australia (1.48% and 2.51%, respectively). A correlation coefficient matrix detected 160 pairs of mutations (correlation coefficient greater than 0.7). We also addressed the completeness of the GISAID database, patient gender, and age. Finally, we found ORF6 and E to be the most conserved genes (96.15% and 94.66% of the sequences totally match the reference, respectively). Our results indicate multiple areas for further research in both SARS-CoV-2 studies and health science. |
format | Online Article Text |
id | pubmed-8547852 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | Elsevier Ltd. |
record_format | MEDLINE/PubMed |
spelling | pubmed-85478522021-10-27 Analysis of 329,942 SARS-CoV-2 records retrieved from GISAID database Zelenova, Maria Ivanova, Anna Semyonov, Semyon Gankin, Yuriy Comput Biol Med Article BACKGROUND: The SARS-CoV-2 virus caused a worldwide pandemic – although none of its predecessors from the coronavirus family ever achieved such a scale. The key to understanding the global success of SARS-CoV-2 is hidden in its genome. MATERIALS AND METHODS: We retrieved data for 329,942 SARS-CoV-2 records uploaded to the GISAID database from the beginning of the pandemic until the January 8, 2021. A Python variant detection script was developed to process the data using pairwise2 from the BioPython library. Sequence alignments were performed for every gene separately (except ORF1ab, which was not studied). Genomes less than 26,000 nucleotides long were excluded from the research. Clustering was performed using HDBScan. RESULTS: Here, we addressed the genetic variability of SARS-CoV-2 using 329,942 samples. The analysis yielded 155 SNPs and deletions in more than 0.3% of the sequences. Clustering results suggested that a proportion of people (2.46%) was infected with a distinct subtype of the B.1.1.7 variant, which contained four to six additional mutations (G28881A, G28882A, G28883С, A23403G, A28095T, G25437T). Two clusters were formed by mutations in the samples uploaded predominantly by Denmark and Australia (1.48% and 2.51%, respectively). A correlation coefficient matrix detected 160 pairs of mutations (correlation coefficient greater than 0.7). We also addressed the completeness of the GISAID database, patient gender, and age. Finally, we found ORF6 and E to be the most conserved genes (96.15% and 94.66% of the sequences totally match the reference, respectively). Our results indicate multiple areas for further research in both SARS-CoV-2 studies and health science. Elsevier Ltd. 2021-12 2021-10-26 /pmc/articles/PMC8547852/ /pubmed/34735950 http://dx.doi.org/10.1016/j.compbiomed.2021.104981 Text en © 2021 Elsevier Ltd. All rights reserved. Since January 2020 Elsevier has created a COVID-19 resource centre with free information in English and Mandarin on the novel coronavirus COVID-19. The COVID-19 resource centre is hosted on Elsevier Connect, the company's public news and information website. Elsevier hereby grants permission to make all its COVID-19-related research that is available on the COVID-19 resource centre - including this research content - immediately available in PubMed Central and other publicly funded repositories, such as the WHO COVID database with rights for unrestricted research re-use and analyses in any form or by any means with acknowledgement of the original source. These permissions are granted for free by Elsevier for as long as the COVID-19 resource centre remains active. |
spellingShingle | Article Zelenova, Maria Ivanova, Anna Semyonov, Semyon Gankin, Yuriy Analysis of 329,942 SARS-CoV-2 records retrieved from GISAID database |
title | Analysis of 329,942 SARS-CoV-2 records retrieved from GISAID database |
title_full | Analysis of 329,942 SARS-CoV-2 records retrieved from GISAID database |
title_fullStr | Analysis of 329,942 SARS-CoV-2 records retrieved from GISAID database |
title_full_unstemmed | Analysis of 329,942 SARS-CoV-2 records retrieved from GISAID database |
title_short | Analysis of 329,942 SARS-CoV-2 records retrieved from GISAID database |
title_sort | analysis of 329,942 sars-cov-2 records retrieved from gisaid database |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8547852/ https://www.ncbi.nlm.nih.gov/pubmed/34735950 http://dx.doi.org/10.1016/j.compbiomed.2021.104981 |
work_keys_str_mv | AT zelenovamaria analysisof329942sarscov2recordsretrievedfromgisaiddatabase AT ivanovaanna analysisof329942sarscov2recordsretrievedfromgisaiddatabase AT semyonovsemyon analysisof329942sarscov2recordsretrievedfromgisaiddatabase AT gankinyuriy analysisof329942sarscov2recordsretrievedfromgisaiddatabase |