Cargando…

Understanding mutation hotspots for the SARS-CoV-2 spike protein using Shannon Entropy and K-means clustering

The SARS-CoV-2 virus like many other viruses has transformed in a continual manner to give rise to new variants by means of mutations commonly through substitutions and indels. These mutations in some cases can give the virus a survival advantage making the mutants dangerous. In general, laboratory...

Descripción completa

Detalles Bibliográficos
Autores principales:	Mullick, Baishali, Magar, Rishikesh, Jhunjhunwala, Aastha, Barati Farimani, Amir
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	The Authors. Published by Elsevier Ltd. 2021
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8492016/ https://www.ncbi.nlm.nih.gov/pubmed/34655896 http://dx.doi.org/10.1016/j.compbiomed.2021.104915

_version_	1784578847418613760
author	Mullick, Baishali Magar, Rishikesh Jhunjhunwala, Aastha Barati Farimani, Amir
author_facet	Mullick, Baishali Magar, Rishikesh Jhunjhunwala, Aastha Barati Farimani, Amir
author_sort	Mullick, Baishali
collection	PubMed
description	The SARS-CoV-2 virus like many other viruses has transformed in a continual manner to give rise to new variants by means of mutations commonly through substitutions and indels. These mutations in some cases can give the virus a survival advantage making the mutants dangerous. In general, laboratory investigation must be carried to determine whether the new variants have any characteristics that can make them more lethal and contagious. Therefore, complex and time-consuming analyses are required in order to delve deeper into the exact impact of a particular mutation. The time required for these analyses makes it difficult to understand the variants of concern and thereby limiting the preventive action that can be taken against them spreading rapidly. In this analysis, we have deployed a statistical technique Shannon Entropy, to identify positions in the spike protein of SARS Cov-2 viral sequence which are most susceptible to mutations. Subsequently, we also use machine learning based clustering techniques to cluster known dangerous mutations based on similarities in properties. This work utilizes embeddings generated using language modeling, the ProtBERT model, to identify mutations of a similar nature and to pick out regions of interest based on proneness to change. Our entropy-based analysis successfully predicted the fifteen hotspot regions, among which we were able to validate ten known variants of interest, in six hotspot regions. As the situation of SARS-COV-2 virus rapidly evolves we believe that the remaining nine mutational hotspots may contain variants that can emerge in the future. We believe that this may be promising in helping the research community to devise therapeutics based on probable new mutation zones in the viral sequence and resemblance in properties of various mutations.
format	Online Article Text
id	pubmed-8492016
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	The Authors. Published by Elsevier Ltd.
record_format	MEDLINE/PubMed
spelling	pubmed-84920162021-10-06 Understanding mutation hotspots for the SARS-CoV-2 spike protein using Shannon Entropy and K-means clustering Mullick, Baishali Magar, Rishikesh Jhunjhunwala, Aastha Barati Farimani, Amir Comput Biol Med Article The SARS-CoV-2 virus like many other viruses has transformed in a continual manner to give rise to new variants by means of mutations commonly through substitutions and indels. These mutations in some cases can give the virus a survival advantage making the mutants dangerous. In general, laboratory investigation must be carried to determine whether the new variants have any characteristics that can make them more lethal and contagious. Therefore, complex and time-consuming analyses are required in order to delve deeper into the exact impact of a particular mutation. The time required for these analyses makes it difficult to understand the variants of concern and thereby limiting the preventive action that can be taken against them spreading rapidly. In this analysis, we have deployed a statistical technique Shannon Entropy, to identify positions in the spike protein of SARS Cov-2 viral sequence which are most susceptible to mutations. Subsequently, we also use machine learning based clustering techniques to cluster known dangerous mutations based on similarities in properties. This work utilizes embeddings generated using language modeling, the ProtBERT model, to identify mutations of a similar nature and to pick out regions of interest based on proneness to change. Our entropy-based analysis successfully predicted the fifteen hotspot regions, among which we were able to validate ten known variants of interest, in six hotspot regions. As the situation of SARS-COV-2 virus rapidly evolves we believe that the remaining nine mutational hotspots may contain variants that can emerge in the future. We believe that this may be promising in helping the research community to devise therapeutics based on probable new mutation zones in the viral sequence and resemblance in properties of various mutations. The Authors. Published by Elsevier Ltd. 2021-11 2021-10-05 /pmc/articles/PMC8492016/ /pubmed/34655896 http://dx.doi.org/10.1016/j.compbiomed.2021.104915 Text en © 2021 The Authors Since January 2020 Elsevier has created a COVID-19 resource centre with free information in English and Mandarin on the novel coronavirus COVID-19. The COVID-19 resource centre is hosted on Elsevier Connect, the company's public news and information website. Elsevier hereby grants permission to make all its COVID-19-related research that is available on the COVID-19 resource centre - including this research content - immediately available in PubMed Central and other publicly funded repositories, such as the WHO COVID database with rights for unrestricted research re-use and analyses in any form or by any means with acknowledgement of the original source. These permissions are granted for free by Elsevier for as long as the COVID-19 resource centre remains active.
spellingShingle	Article Mullick, Baishali Magar, Rishikesh Jhunjhunwala, Aastha Barati Farimani, Amir Understanding mutation hotspots for the SARS-CoV-2 spike protein using Shannon Entropy and K-means clustering
title	Understanding mutation hotspots for the SARS-CoV-2 spike protein using Shannon Entropy and K-means clustering
title_full	Understanding mutation hotspots for the SARS-CoV-2 spike protein using Shannon Entropy and K-means clustering
title_fullStr	Understanding mutation hotspots for the SARS-CoV-2 spike protein using Shannon Entropy and K-means clustering
title_full_unstemmed	Understanding mutation hotspots for the SARS-CoV-2 spike protein using Shannon Entropy and K-means clustering
title_short	Understanding mutation hotspots for the SARS-CoV-2 spike protein using Shannon Entropy and K-means clustering
title_sort	understanding mutation hotspots for the sars-cov-2 spike protein using shannon entropy and k-means clustering
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8492016/ https://www.ncbi.nlm.nih.gov/pubmed/34655896 http://dx.doi.org/10.1016/j.compbiomed.2021.104915
work_keys_str_mv	AT mullickbaishali understandingmutationhotspotsforthesarscov2spikeproteinusingshannonentropyandkmeansclustering AT magarrishikesh understandingmutationhotspotsforthesarscov2spikeproteinusingshannonentropyandkmeansclustering AT jhunjhunwalaaastha understandingmutationhotspotsforthesarscov2spikeproteinusingshannonentropyandkmeansclustering AT baratifarimaniamir understandingmutationhotspotsforthesarscov2spikeproteinusingshannonentropyandkmeansclustering

Understanding mutation hotspots for the SARS-CoV-2 spike protein using Shannon Entropy and K-means clustering

Ejemplares similares