Cargando…

NGS data vectorization, clustering, and finding key codons in SARS-CoV-2 variations

The rapid global spread and dissemination of SARS-CoV-2 has provided the virus with numerous opportunities to develop several variants. Thus, it is critical to determine the degree of the variations and in which part of the virus those variations occurred. Therefore, in this study, methods that coul...

Descripción completa

Detalles Bibliográficos
Autores principales: Kim, Juhyeon, Cheon, Saeyeon, Ahn, Insung
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9113074/
https://www.ncbi.nlm.nih.gov/pubmed/35581558
http://dx.doi.org/10.1186/s12859-022-04718-7
_version_ 1784709520571760640
author Kim, Juhyeon
Cheon, Saeyeon
Ahn, Insung
author_facet Kim, Juhyeon
Cheon, Saeyeon
Ahn, Insung
author_sort Kim, Juhyeon
collection PubMed
description The rapid global spread and dissemination of SARS-CoV-2 has provided the virus with numerous opportunities to develop several variants. Thus, it is critical to determine the degree of the variations and in which part of the virus those variations occurred. Therefore, in this study, methods that could be used to vectorize the sequence data, perform clustering analysis, and visualize the results were proposed using machine learning methods. To conduct this study, a total of 224,073 cases of SARS-CoV-2 sequence data were collected through NCBI and GISAID, and the data were visualized using dimensionality reduction and clustering analysis models such as T-SNE and DBSCAN. The SARS-CoV-2 virus, which was first detected, was distinguished from different variations, including Omicron and Delta, in the cluster results. Furthermore, it was possible to examine which codon changes in the spike protein caused the variants to be distinguished using feature importance extraction models such as Random Forest or Shapely Value. The proposed method has the advantage of being able to analyse and visualize a large amount of data at once compared to the existing tree-based sequence data analysis. The proposed method was able to identify and visualize significant changes between the SARS-CoV-2 virus, which was first detected in Wuhan, China, in December 2019, and the newly formed mutant virus group. As a result of clustering analysis using sequence data, it was possible to confirm the formation of clusters among various variants in a two-dimensional graph, and by extracting the importance of variables, it was possible to confirm which codon changes played a major role in distinguishing variants. Furthermore, since the proposed method can handle a variety of data sequences, it can be used for all kinds of diseases, including influenza and SARS-CoV-2. Therefore, the proposed method has the potential to become widely used for the effective analysis of disease variations. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-022-04718-7.
format Online
Article
Text
id pubmed-9113074
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-91130742022-05-18 NGS data vectorization, clustering, and finding key codons in SARS-CoV-2 variations Kim, Juhyeon Cheon, Saeyeon Ahn, Insung BMC Bioinformatics Research The rapid global spread and dissemination of SARS-CoV-2 has provided the virus with numerous opportunities to develop several variants. Thus, it is critical to determine the degree of the variations and in which part of the virus those variations occurred. Therefore, in this study, methods that could be used to vectorize the sequence data, perform clustering analysis, and visualize the results were proposed using machine learning methods. To conduct this study, a total of 224,073 cases of SARS-CoV-2 sequence data were collected through NCBI and GISAID, and the data were visualized using dimensionality reduction and clustering analysis models such as T-SNE and DBSCAN. The SARS-CoV-2 virus, which was first detected, was distinguished from different variations, including Omicron and Delta, in the cluster results. Furthermore, it was possible to examine which codon changes in the spike protein caused the variants to be distinguished using feature importance extraction models such as Random Forest or Shapely Value. The proposed method has the advantage of being able to analyse and visualize a large amount of data at once compared to the existing tree-based sequence data analysis. The proposed method was able to identify and visualize significant changes between the SARS-CoV-2 virus, which was first detected in Wuhan, China, in December 2019, and the newly formed mutant virus group. As a result of clustering analysis using sequence data, it was possible to confirm the formation of clusters among various variants in a two-dimensional graph, and by extracting the importance of variables, it was possible to confirm which codon changes played a major role in distinguishing variants. Furthermore, since the proposed method can handle a variety of data sequences, it can be used for all kinds of diseases, including influenza and SARS-CoV-2. Therefore, the proposed method has the potential to become widely used for the effective analysis of disease variations. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-022-04718-7. BioMed Central 2022-05-17 /pmc/articles/PMC9113074/ /pubmed/35581558 http://dx.doi.org/10.1186/s12859-022-04718-7 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Kim, Juhyeon
Cheon, Saeyeon
Ahn, Insung
NGS data vectorization, clustering, and finding key codons in SARS-CoV-2 variations
title NGS data vectorization, clustering, and finding key codons in SARS-CoV-2 variations
title_full NGS data vectorization, clustering, and finding key codons in SARS-CoV-2 variations
title_fullStr NGS data vectorization, clustering, and finding key codons in SARS-CoV-2 variations
title_full_unstemmed NGS data vectorization, clustering, and finding key codons in SARS-CoV-2 variations
title_short NGS data vectorization, clustering, and finding key codons in SARS-CoV-2 variations
title_sort ngs data vectorization, clustering, and finding key codons in sars-cov-2 variations
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9113074/
https://www.ncbi.nlm.nih.gov/pubmed/35581558
http://dx.doi.org/10.1186/s12859-022-04718-7
work_keys_str_mv AT kimjuhyeon ngsdatavectorizationclusteringandfindingkeycodonsinsarscov2variations
AT cheonsaeyeon ngsdatavectorizationclusteringandfindingkeycodonsinsarscov2variations
AT ahninsung ngsdatavectorizationclusteringandfindingkeycodonsinsarscov2variations