Cargando…

UMAP-assisted K-means clustering of large-scale SARS-CoV-2 mutation datasets

Coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has a worldwide devastating effect. Understanding the evolution and transmission of SARS-CoV-2 is of paramount importance for controlling, combating and preventing COVID-19. Due to the rapid gr...

Descripción completa

Detalles Bibliográficos
Autores principales: Hozumi, Yuta, Wang, Rui, Yin, Changchuan, Wei, Guo-Wei
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier Ltd. 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7897976/
https://www.ncbi.nlm.nih.gov/pubmed/33647832
http://dx.doi.org/10.1016/j.compbiomed.2021.104264
_version_ 1783653777064067072
author Hozumi, Yuta
Wang, Rui
Yin, Changchuan
Wei, Guo-Wei
author_facet Hozumi, Yuta
Wang, Rui
Yin, Changchuan
Wei, Guo-Wei
author_sort Hozumi, Yuta
collection PubMed
description Coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has a worldwide devastating effect. Understanding the evolution and transmission of SARS-CoV-2 is of paramount importance for controlling, combating and preventing COVID-19. Due to the rapid growth in both the number of SARS-CoV-2 genome sequences and the number of unique mutations, the phylogenetic analysis of SARS-CoV-2 genome isolates faces an emergent large-data challenge. We introduce a dimension-reduced K-means clustering strategy to tackle this challenge. We examine the performance and effectiveness of three dimension-reduction algorithms: principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP). By using four benchmark datasets, we found that UMAP is the best-suited technique due to its stable, reliable, and efficient performance, its ability to improve clustering accuracy, especially for large Jaccard distanced-based datasets, and its superior clustering visualization. The UMAP-assisted K-means clustering enables us to shed light on increasingly large datasets from SARS-CoV-2 genome isolates.
format Online
Article
Text
id pubmed-7897976
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Elsevier Ltd.
record_format MEDLINE/PubMed
spelling pubmed-78979762021-02-22 UMAP-assisted K-means clustering of large-scale SARS-CoV-2 mutation datasets Hozumi, Yuta Wang, Rui Yin, Changchuan Wei, Guo-Wei Comput Biol Med Article Coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has a worldwide devastating effect. Understanding the evolution and transmission of SARS-CoV-2 is of paramount importance for controlling, combating and preventing COVID-19. Due to the rapid growth in both the number of SARS-CoV-2 genome sequences and the number of unique mutations, the phylogenetic analysis of SARS-CoV-2 genome isolates faces an emergent large-data challenge. We introduce a dimension-reduced K-means clustering strategy to tackle this challenge. We examine the performance and effectiveness of three dimension-reduction algorithms: principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP). By using four benchmark datasets, we found that UMAP is the best-suited technique due to its stable, reliable, and efficient performance, its ability to improve clustering accuracy, especially for large Jaccard distanced-based datasets, and its superior clustering visualization. The UMAP-assisted K-means clustering enables us to shed light on increasingly large datasets from SARS-CoV-2 genome isolates. Elsevier Ltd. 2021-04 2021-02-22 /pmc/articles/PMC7897976/ /pubmed/33647832 http://dx.doi.org/10.1016/j.compbiomed.2021.104264 Text en © 2021 Elsevier Ltd. All rights reserved. Since January 2020 Elsevier has created a COVID-19 resource centre with free information in English and Mandarin on the novel coronavirus COVID-19. The COVID-19 resource centre is hosted on Elsevier Connect, the company's public news and information website. Elsevier hereby grants permission to make all its COVID-19-related research that is available on the COVID-19 resource centre - including this research content - immediately available in PubMed Central and other publicly funded repositories, such as the WHO COVID database with rights for unrestricted research re-use and analyses in any form or by any means with acknowledgement of the original source. These permissions are granted for free by Elsevier for as long as the COVID-19 resource centre remains active.
spellingShingle Article
Hozumi, Yuta
Wang, Rui
Yin, Changchuan
Wei, Guo-Wei
UMAP-assisted K-means clustering of large-scale SARS-CoV-2 mutation datasets
title UMAP-assisted K-means clustering of large-scale SARS-CoV-2 mutation datasets
title_full UMAP-assisted K-means clustering of large-scale SARS-CoV-2 mutation datasets
title_fullStr UMAP-assisted K-means clustering of large-scale SARS-CoV-2 mutation datasets
title_full_unstemmed UMAP-assisted K-means clustering of large-scale SARS-CoV-2 mutation datasets
title_short UMAP-assisted K-means clustering of large-scale SARS-CoV-2 mutation datasets
title_sort umap-assisted k-means clustering of large-scale sars-cov-2 mutation datasets
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7897976/
https://www.ncbi.nlm.nih.gov/pubmed/33647832
http://dx.doi.org/10.1016/j.compbiomed.2021.104264
work_keys_str_mv AT hozumiyuta umapassistedkmeansclusteringoflargescalesarscov2mutationdatasets
AT wangrui umapassistedkmeansclusteringoflargescalesarscov2mutationdatasets
AT yinchangchuan umapassistedkmeansclusteringoflargescalesarscov2mutationdatasets
AT weiguowei umapassistedkmeansclusteringoflargescalesarscov2mutationdatasets