Cargando…
covSampler: A subsampling method with balanced genetic diversity for large-scale SARS-CoV-2 genome data sets
Phylogenetic analysis has been widely used to describe, display, and infer the evolutionary patterns of viruses. The unprecedented accumulation of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genomes has provided valuable materials for the real-time study of SARS-CoV-2 evolution. How...
Autores principales: | , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9384632/ https://www.ncbi.nlm.nih.gov/pubmed/36533145 http://dx.doi.org/10.1093/ve/veac071 |
_version_ | 1784769455061991424 |
---|---|
author | Cheng, Yexiao Ji, Chengyang Han, Na Li, Jiaying Xu, Lin Chen, Ziyi Yang, Rong Zhou, Hang-Yu Wu, Aiping |
author_facet | Cheng, Yexiao Ji, Chengyang Han, Na Li, Jiaying Xu, Lin Chen, Ziyi Yang, Rong Zhou, Hang-Yu Wu, Aiping |
author_sort | Cheng, Yexiao |
collection | PubMed |
description | Phylogenetic analysis has been widely used to describe, display, and infer the evolutionary patterns of viruses. The unprecedented accumulation of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genomes has provided valuable materials for the real-time study of SARS-CoV-2 evolution. However, the large number of SARS-CoV-2 genome sequences also poses great challenges for data analysis. Several methods for subsampling these large data sets have been introduced. However, current methods mainly focus on the spatiotemporal distribution of genomes without considering their genetic diversity, which might lead to post-subsampling bias. In this study, a subsampling method named covSampler was developed for the subsampling of SARS-CoV-2 genomes with consideration of both their spatiotemporal distribution and their genetic diversity. First, covSampler clusters all genomes according to their spatiotemporal distribution and genetic variation into groups that we call divergent pathways. Then, based on these divergent pathways, two kinds of subsampling strategies, representative subsampling and comprehensive subsampling, were provided with adjustable parameters to meet different users’ requirements. Our performance and validation tests indicate that covSampler is efficient and stable, with an abundance of options for user customization. Overall, our work has developed an easy-to-use tool and a webserver (https://www.covsampler.net) for the subsampling of SARS-CoV-2 genome sequences. |
format | Online Article Text |
id | pubmed-9384632 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-93846322022-08-18 covSampler: A subsampling method with balanced genetic diversity for large-scale SARS-CoV-2 genome data sets Cheng, Yexiao Ji, Chengyang Han, Na Li, Jiaying Xu, Lin Chen, Ziyi Yang, Rong Zhou, Hang-Yu Wu, Aiping Virus Evol Research Article Phylogenetic analysis has been widely used to describe, display, and infer the evolutionary patterns of viruses. The unprecedented accumulation of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genomes has provided valuable materials for the real-time study of SARS-CoV-2 evolution. However, the large number of SARS-CoV-2 genome sequences also poses great challenges for data analysis. Several methods for subsampling these large data sets have been introduced. However, current methods mainly focus on the spatiotemporal distribution of genomes without considering their genetic diversity, which might lead to post-subsampling bias. In this study, a subsampling method named covSampler was developed for the subsampling of SARS-CoV-2 genomes with consideration of both their spatiotemporal distribution and their genetic diversity. First, covSampler clusters all genomes according to their spatiotemporal distribution and genetic variation into groups that we call divergent pathways. Then, based on these divergent pathways, two kinds of subsampling strategies, representative subsampling and comprehensive subsampling, were provided with adjustable parameters to meet different users’ requirements. Our performance and validation tests indicate that covSampler is efficient and stable, with an abundance of options for user customization. Overall, our work has developed an easy-to-use tool and a webserver (https://www.covsampler.net) for the subsampling of SARS-CoV-2 genome sequences. Oxford University Press 2022-08-05 /pmc/articles/PMC9384632/ /pubmed/36533145 http://dx.doi.org/10.1093/ve/veac071 Text en © The Author(s) 2022. Published by Oxford University Press. https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com |
spellingShingle | Research Article Cheng, Yexiao Ji, Chengyang Han, Na Li, Jiaying Xu, Lin Chen, Ziyi Yang, Rong Zhou, Hang-Yu Wu, Aiping covSampler: A subsampling method with balanced genetic diversity for large-scale SARS-CoV-2 genome data sets |
title | covSampler: A subsampling method with balanced genetic diversity for large-scale SARS-CoV-2 genome data sets |
title_full | covSampler: A subsampling method with balanced genetic diversity for large-scale SARS-CoV-2 genome data sets |
title_fullStr | covSampler: A subsampling method with balanced genetic diversity for large-scale SARS-CoV-2 genome data sets |
title_full_unstemmed | covSampler: A subsampling method with balanced genetic diversity for large-scale SARS-CoV-2 genome data sets |
title_short | covSampler: A subsampling method with balanced genetic diversity for large-scale SARS-CoV-2 genome data sets |
title_sort | covsampler: a subsampling method with balanced genetic diversity for large-scale sars-cov-2 genome data sets |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9384632/ https://www.ncbi.nlm.nih.gov/pubmed/36533145 http://dx.doi.org/10.1093/ve/veac071 |
work_keys_str_mv | AT chengyexiao covsamplerasubsamplingmethodwithbalancedgeneticdiversityforlargescalesarscov2genomedatasets AT jichengyang covsamplerasubsamplingmethodwithbalancedgeneticdiversityforlargescalesarscov2genomedatasets AT hanna covsamplerasubsamplingmethodwithbalancedgeneticdiversityforlargescalesarscov2genomedatasets AT lijiaying covsamplerasubsamplingmethodwithbalancedgeneticdiversityforlargescalesarscov2genomedatasets AT xulin covsamplerasubsamplingmethodwithbalancedgeneticdiversityforlargescalesarscov2genomedatasets AT chenziyi covsamplerasubsamplingmethodwithbalancedgeneticdiversityforlargescalesarscov2genomedatasets AT yangrong covsamplerasubsamplingmethodwithbalancedgeneticdiversityforlargescalesarscov2genomedatasets AT zhouhangyu covsamplerasubsamplingmethodwithbalancedgeneticdiversityforlargescalesarscov2genomedatasets AT wuaiping covsamplerasubsamplingmethodwithbalancedgeneticdiversityforlargescalesarscov2genomedatasets |