Cargando…

covSampler: A subsampling method with balanced genetic diversity for large-scale SARS-CoV-2 genome data sets

Phylogenetic analysis has been widely used to describe, display, and infer the evolutionary patterns of viruses. The unprecedented accumulation of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genomes has provided valuable materials for the real-time study of SARS-CoV-2 evolution. How...

Descripción completa

Detalles Bibliográficos
Autores principales: Cheng, Yexiao, Ji, Chengyang, Han, Na, Li, Jiaying, Xu, Lin, Chen, Ziyi, Yang, Rong, Zhou, Hang-Yu, Wu, Aiping
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9384632/
https://www.ncbi.nlm.nih.gov/pubmed/36533145
http://dx.doi.org/10.1093/ve/veac071
_version_ 1784769455061991424
author Cheng, Yexiao
Ji, Chengyang
Han, Na
Li, Jiaying
Xu, Lin
Chen, Ziyi
Yang, Rong
Zhou, Hang-Yu
Wu, Aiping
author_facet Cheng, Yexiao
Ji, Chengyang
Han, Na
Li, Jiaying
Xu, Lin
Chen, Ziyi
Yang, Rong
Zhou, Hang-Yu
Wu, Aiping
author_sort Cheng, Yexiao
collection PubMed
description Phylogenetic analysis has been widely used to describe, display, and infer the evolutionary patterns of viruses. The unprecedented accumulation of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genomes has provided valuable materials for the real-time study of SARS-CoV-2 evolution. However, the large number of SARS-CoV-2 genome sequences also poses great challenges for data analysis. Several methods for subsampling these large data sets have been introduced. However, current methods mainly focus on the spatiotemporal distribution of genomes without considering their genetic diversity, which might lead to post-subsampling bias. In this study, a subsampling method named covSampler was developed for the subsampling of SARS-CoV-2 genomes with consideration of both their spatiotemporal distribution and their genetic diversity. First, covSampler clusters all genomes according to their spatiotemporal distribution and genetic variation into groups that we call divergent pathways. Then, based on these divergent pathways, two kinds of subsampling strategies, representative subsampling and comprehensive subsampling, were provided with adjustable parameters to meet different users’ requirements. Our performance and validation tests indicate that covSampler is efficient and stable, with an abundance of options for user customization. Overall, our work has developed an easy-to-use tool and a webserver (https://www.covsampler.net) for the subsampling of SARS-CoV-2 genome sequences.
format Online
Article
Text
id pubmed-9384632
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-93846322022-08-18 covSampler: A subsampling method with balanced genetic diversity for large-scale SARS-CoV-2 genome data sets Cheng, Yexiao Ji, Chengyang Han, Na Li, Jiaying Xu, Lin Chen, Ziyi Yang, Rong Zhou, Hang-Yu Wu, Aiping Virus Evol Research Article Phylogenetic analysis has been widely used to describe, display, and infer the evolutionary patterns of viruses. The unprecedented accumulation of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genomes has provided valuable materials for the real-time study of SARS-CoV-2 evolution. However, the large number of SARS-CoV-2 genome sequences also poses great challenges for data analysis. Several methods for subsampling these large data sets have been introduced. However, current methods mainly focus on the spatiotemporal distribution of genomes without considering their genetic diversity, which might lead to post-subsampling bias. In this study, a subsampling method named covSampler was developed for the subsampling of SARS-CoV-2 genomes with consideration of both their spatiotemporal distribution and their genetic diversity. First, covSampler clusters all genomes according to their spatiotemporal distribution and genetic variation into groups that we call divergent pathways. Then, based on these divergent pathways, two kinds of subsampling strategies, representative subsampling and comprehensive subsampling, were provided with adjustable parameters to meet different users’ requirements. Our performance and validation tests indicate that covSampler is efficient and stable, with an abundance of options for user customization. Overall, our work has developed an easy-to-use tool and a webserver (https://www.covsampler.net) for the subsampling of SARS-CoV-2 genome sequences. Oxford University Press 2022-08-05 /pmc/articles/PMC9384632/ /pubmed/36533145 http://dx.doi.org/10.1093/ve/veac071 Text en © The Author(s) 2022. Published by Oxford University Press. https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Research Article
Cheng, Yexiao
Ji, Chengyang
Han, Na
Li, Jiaying
Xu, Lin
Chen, Ziyi
Yang, Rong
Zhou, Hang-Yu
Wu, Aiping
covSampler: A subsampling method with balanced genetic diversity for large-scale SARS-CoV-2 genome data sets
title covSampler: A subsampling method with balanced genetic diversity for large-scale SARS-CoV-2 genome data sets
title_full covSampler: A subsampling method with balanced genetic diversity for large-scale SARS-CoV-2 genome data sets
title_fullStr covSampler: A subsampling method with balanced genetic diversity for large-scale SARS-CoV-2 genome data sets
title_full_unstemmed covSampler: A subsampling method with balanced genetic diversity for large-scale SARS-CoV-2 genome data sets
title_short covSampler: A subsampling method with balanced genetic diversity for large-scale SARS-CoV-2 genome data sets
title_sort covsampler: a subsampling method with balanced genetic diversity for large-scale sars-cov-2 genome data sets
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9384632/
https://www.ncbi.nlm.nih.gov/pubmed/36533145
http://dx.doi.org/10.1093/ve/veac071
work_keys_str_mv AT chengyexiao covsamplerasubsamplingmethodwithbalancedgeneticdiversityforlargescalesarscov2genomedatasets
AT jichengyang covsamplerasubsamplingmethodwithbalancedgeneticdiversityforlargescalesarscov2genomedatasets
AT hanna covsamplerasubsamplingmethodwithbalancedgeneticdiversityforlargescalesarscov2genomedatasets
AT lijiaying covsamplerasubsamplingmethodwithbalancedgeneticdiversityforlargescalesarscov2genomedatasets
AT xulin covsamplerasubsamplingmethodwithbalancedgeneticdiversityforlargescalesarscov2genomedatasets
AT chenziyi covsamplerasubsamplingmethodwithbalancedgeneticdiversityforlargescalesarscov2genomedatasets
AT yangrong covsamplerasubsamplingmethodwithbalancedgeneticdiversityforlargescalesarscov2genomedatasets
AT zhouhangyu covsamplerasubsamplingmethodwithbalancedgeneticdiversityforlargescalesarscov2genomedatasets
AT wuaiping covsamplerasubsamplingmethodwithbalancedgeneticdiversityforlargescalesarscov2genomedatasets