Cargando…

Genome-wide analysis of 10664 SARS-CoV-2 genomes to identify virus strains in 73 countries based on single nucleotide polymorphism

Since the onslaught of SARS-CoV-2, the research community has been searching for a vaccine to fight against this virus. However, during this period, the virus has mutated to adapt to the different environmental conditions in the world and made the task of vaccine design more challenging. In this sit...

Descripción completa

Detalles Bibliográficos
Autores principales: Ghosh, Nimisha, Saha, Indrajit, Sharma, Nikhil, Nandi, Suman, Plewczynski, Dariusz
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier B.V. 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7997709/
https://www.ncbi.nlm.nih.gov/pubmed/33781798
http://dx.doi.org/10.1016/j.virusres.2021.198401
_version_ 1783670389753249792
author Ghosh, Nimisha
Saha, Indrajit
Sharma, Nikhil
Nandi, Suman
Plewczynski, Dariusz
author_facet Ghosh, Nimisha
Saha, Indrajit
Sharma, Nikhil
Nandi, Suman
Plewczynski, Dariusz
author_sort Ghosh, Nimisha
collection PubMed
description Since the onslaught of SARS-CoV-2, the research community has been searching for a vaccine to fight against this virus. However, during this period, the virus has mutated to adapt to the different environmental conditions in the world and made the task of vaccine design more challenging. In this situation, the identification of virus strains is very much timely and important task. We have performed genome-wide analysis of 10664 SARS-CoV-2 genomes of 73 countries to identify and prepare a Single Nucleotide Polymorphism (SNP) dataset of SARS-CoV-2. Thereafter, with the use of this SNP data, the advantage of hierarchical clustering is taken care of in such a way so that Average Linkage and Complete Linkage with Jaccard and Hamming distance functions are applied separately in order to identify the virus strains as clusters present in the SNP data. In this regard, the consensus of both the clustering results are also considered while Silhouette index is used as a cluster validity index to measure the goodness of the clusters as well to determine the number of clusters or virus strains. As a result, we have identified five major clusters or virus strains present worldwide. Apart from quantitative measures, these clusters are also visualized using Visual Assessment of Tendency (VAT) plot. The evolution of these clusters are also shown. Furthermore, top 10 signature SNPs are identified in each cluster and the non-synonymous signature SNPs are visualised in the respective protein structures. Also, the sequence and structural homology-based prediction along with the protein structural stability of these non-synonymous signature SNPs are reported in order to judge the characteristics of the identified clusters. As a consequence, T85I, Q57H and R203M in NSP2, ORF3a and Nucleocapsid respectively are found to be responsible for Cluster 1 as they are damaging and unstable non-synonymous signature SNPs. Similarly, F506L and S507C in Exon are responsible for both Clusters 3 and 4 while Clusters 2 and 5 do not exhibit such behaviour due to the absence of any non-synonymous signature SNPs. In addition to all these, the code, SNP dataset, 10664 labelled SARS-CoV-2 strains and additional results as supplementary are provided through our website for further use.
format Online
Article
Text
id pubmed-7997709
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Elsevier B.V.
record_format MEDLINE/PubMed
spelling pubmed-79977092021-03-29 Genome-wide analysis of 10664 SARS-CoV-2 genomes to identify virus strains in 73 countries based on single nucleotide polymorphism Ghosh, Nimisha Saha, Indrajit Sharma, Nikhil Nandi, Suman Plewczynski, Dariusz Virus Res Article Since the onslaught of SARS-CoV-2, the research community has been searching for a vaccine to fight against this virus. However, during this period, the virus has mutated to adapt to the different environmental conditions in the world and made the task of vaccine design more challenging. In this situation, the identification of virus strains is very much timely and important task. We have performed genome-wide analysis of 10664 SARS-CoV-2 genomes of 73 countries to identify and prepare a Single Nucleotide Polymorphism (SNP) dataset of SARS-CoV-2. Thereafter, with the use of this SNP data, the advantage of hierarchical clustering is taken care of in such a way so that Average Linkage and Complete Linkage with Jaccard and Hamming distance functions are applied separately in order to identify the virus strains as clusters present in the SNP data. In this regard, the consensus of both the clustering results are also considered while Silhouette index is used as a cluster validity index to measure the goodness of the clusters as well to determine the number of clusters or virus strains. As a result, we have identified five major clusters or virus strains present worldwide. Apart from quantitative measures, these clusters are also visualized using Visual Assessment of Tendency (VAT) plot. The evolution of these clusters are also shown. Furthermore, top 10 signature SNPs are identified in each cluster and the non-synonymous signature SNPs are visualised in the respective protein structures. Also, the sequence and structural homology-based prediction along with the protein structural stability of these non-synonymous signature SNPs are reported in order to judge the characteristics of the identified clusters. As a consequence, T85I, Q57H and R203M in NSP2, ORF3a and Nucleocapsid respectively are found to be responsible for Cluster 1 as they are damaging and unstable non-synonymous signature SNPs. Similarly, F506L and S507C in Exon are responsible for both Clusters 3 and 4 while Clusters 2 and 5 do not exhibit such behaviour due to the absence of any non-synonymous signature SNPs. In addition to all these, the code, SNP dataset, 10664 labelled SARS-CoV-2 strains and additional results as supplementary are provided through our website for further use. Elsevier B.V. 2021-06 2021-03-26 /pmc/articles/PMC7997709/ /pubmed/33781798 http://dx.doi.org/10.1016/j.virusres.2021.198401 Text en © 2021 Elsevier B.V. All rights reserved. Since January 2020 Elsevier has created a COVID-19 resource centre with free information in English and Mandarin on the novel coronavirus COVID-19. The COVID-19 resource centre is hosted on Elsevier Connect, the company's public news and information website. Elsevier hereby grants permission to make all its COVID-19-related research that is available on the COVID-19 resource centre - including this research content - immediately available in PubMed Central and other publicly funded repositories, such as the WHO COVID database with rights for unrestricted research re-use and analyses in any form or by any means with acknowledgement of the original source. These permissions are granted for free by Elsevier for as long as the COVID-19 resource centre remains active.
spellingShingle Article
Ghosh, Nimisha
Saha, Indrajit
Sharma, Nikhil
Nandi, Suman
Plewczynski, Dariusz
Genome-wide analysis of 10664 SARS-CoV-2 genomes to identify virus strains in 73 countries based on single nucleotide polymorphism
title Genome-wide analysis of 10664 SARS-CoV-2 genomes to identify virus strains in 73 countries based on single nucleotide polymorphism
title_full Genome-wide analysis of 10664 SARS-CoV-2 genomes to identify virus strains in 73 countries based on single nucleotide polymorphism
title_fullStr Genome-wide analysis of 10664 SARS-CoV-2 genomes to identify virus strains in 73 countries based on single nucleotide polymorphism
title_full_unstemmed Genome-wide analysis of 10664 SARS-CoV-2 genomes to identify virus strains in 73 countries based on single nucleotide polymorphism
title_short Genome-wide analysis of 10664 SARS-CoV-2 genomes to identify virus strains in 73 countries based on single nucleotide polymorphism
title_sort genome-wide analysis of 10664 sars-cov-2 genomes to identify virus strains in 73 countries based on single nucleotide polymorphism
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7997709/
https://www.ncbi.nlm.nih.gov/pubmed/33781798
http://dx.doi.org/10.1016/j.virusres.2021.198401
work_keys_str_mv AT ghoshnimisha genomewideanalysisof10664sarscov2genomestoidentifyvirusstrainsin73countriesbasedonsinglenucleotidepolymorphism
AT sahaindrajit genomewideanalysisof10664sarscov2genomestoidentifyvirusstrainsin73countriesbasedonsinglenucleotidepolymorphism
AT sharmanikhil genomewideanalysisof10664sarscov2genomestoidentifyvirusstrainsin73countriesbasedonsinglenucleotidepolymorphism
AT nandisuman genomewideanalysisof10664sarscov2genomestoidentifyvirusstrainsin73countriesbasedonsinglenucleotidepolymorphism
AT plewczynskidariusz genomewideanalysisof10664sarscov2genomestoidentifyvirusstrainsin73countriesbasedonsinglenucleotidepolymorphism