Cargando…
Towards Efficient and Accurate SARS-CoV-2 Genome Sequence Typing Based on Supervised Learning Approaches
Despite the active development of SARS-CoV-2 surveillance methods (e.g., Nextstrain, GISAID, Pangolin), the global emergence of various SARS-CoV-2 viral lineages that potentially cause antiviral and vaccine failure has driven the need for accurate and efficient SARS-CoV-2 genome sequence classifiers...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
MDPI
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9505117/ https://www.ncbi.nlm.nih.gov/pubmed/36144387 http://dx.doi.org/10.3390/microorganisms10091785 |
_version_ | 1784796392407957504 |
---|---|
author | Miao, Miao De Clercq, Erik Li, Guangdi |
author_facet | Miao, Miao De Clercq, Erik Li, Guangdi |
author_sort | Miao, Miao |
collection | PubMed |
description | Despite the active development of SARS-CoV-2 surveillance methods (e.g., Nextstrain, GISAID, Pangolin), the global emergence of various SARS-CoV-2 viral lineages that potentially cause antiviral and vaccine failure has driven the need for accurate and efficient SARS-CoV-2 genome sequence classifiers. This study presents an optimized method that accurately identifies the viral lineages of SARS-CoV-2 genome sequences using existing schemes. For Nextstrain and GISAID clades, a template matching-based method is proposed to quantify the differences between viral clades and to play an important role in classification evaluation. Furthermore, to improve the typing accuracy of SARS-CoV-2 genome sequences, an ensemble model that integrates a combination of machine learning-based methods (such as Random Forest and Catboost) with optimized weights is proposed for Nextstrain, Pangolin, and GISAID clades. Cross-validation is applied to optimize the parameters of the machine learning-based method and the weight settings of the ensemble model. To improve the efficiency of the model, in addition to the one-hot encoding method, we have proposed a nucleotide site mutation-based data structure that requires less computational resources and performs better in SARS-CoV-2 genome sequence typing. Based on an accumulated database of >1 million SARS-CoV-2 genome sequences, performance evaluations show that the proposed system has a typing accuracy of 99.879%, 97.732%, and 96.291% for Nextstrain, Pangolin, and GISAID clades, respectively. A single prediction only takes an average of <20 ms on a portable laptop. Overall, this study provides an efficient and accurate SARS-CoV-2 genome sequence typing system that benefits current and future surveillance of SARS-CoV-2 variants. |
format | Online Article Text |
id | pubmed-9505117 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | MDPI |
record_format | MEDLINE/PubMed |
spelling | pubmed-95051172022-09-24 Towards Efficient and Accurate SARS-CoV-2 Genome Sequence Typing Based on Supervised Learning Approaches Miao, Miao De Clercq, Erik Li, Guangdi Microorganisms Article Despite the active development of SARS-CoV-2 surveillance methods (e.g., Nextstrain, GISAID, Pangolin), the global emergence of various SARS-CoV-2 viral lineages that potentially cause antiviral and vaccine failure has driven the need for accurate and efficient SARS-CoV-2 genome sequence classifiers. This study presents an optimized method that accurately identifies the viral lineages of SARS-CoV-2 genome sequences using existing schemes. For Nextstrain and GISAID clades, a template matching-based method is proposed to quantify the differences between viral clades and to play an important role in classification evaluation. Furthermore, to improve the typing accuracy of SARS-CoV-2 genome sequences, an ensemble model that integrates a combination of machine learning-based methods (such as Random Forest and Catboost) with optimized weights is proposed for Nextstrain, Pangolin, and GISAID clades. Cross-validation is applied to optimize the parameters of the machine learning-based method and the weight settings of the ensemble model. To improve the efficiency of the model, in addition to the one-hot encoding method, we have proposed a nucleotide site mutation-based data structure that requires less computational resources and performs better in SARS-CoV-2 genome sequence typing. Based on an accumulated database of >1 million SARS-CoV-2 genome sequences, performance evaluations show that the proposed system has a typing accuracy of 99.879%, 97.732%, and 96.291% for Nextstrain, Pangolin, and GISAID clades, respectively. A single prediction only takes an average of <20 ms on a portable laptop. Overall, this study provides an efficient and accurate SARS-CoV-2 genome sequence typing system that benefits current and future surveillance of SARS-CoV-2 variants. MDPI 2022-09-04 /pmc/articles/PMC9505117/ /pubmed/36144387 http://dx.doi.org/10.3390/microorganisms10091785 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Article Miao, Miao De Clercq, Erik Li, Guangdi Towards Efficient and Accurate SARS-CoV-2 Genome Sequence Typing Based on Supervised Learning Approaches |
title | Towards Efficient and Accurate SARS-CoV-2 Genome Sequence Typing Based on Supervised Learning Approaches |
title_full | Towards Efficient and Accurate SARS-CoV-2 Genome Sequence Typing Based on Supervised Learning Approaches |
title_fullStr | Towards Efficient and Accurate SARS-CoV-2 Genome Sequence Typing Based on Supervised Learning Approaches |
title_full_unstemmed | Towards Efficient and Accurate SARS-CoV-2 Genome Sequence Typing Based on Supervised Learning Approaches |
title_short | Towards Efficient and Accurate SARS-CoV-2 Genome Sequence Typing Based on Supervised Learning Approaches |
title_sort | towards efficient and accurate sars-cov-2 genome sequence typing based on supervised learning approaches |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9505117/ https://www.ncbi.nlm.nih.gov/pubmed/36144387 http://dx.doi.org/10.3390/microorganisms10091785 |
work_keys_str_mv | AT miaomiao towardsefficientandaccuratesarscov2genomesequencetypingbasedonsupervisedlearningapproaches AT declercqerik towardsefficientandaccuratesarscov2genomesequencetypingbasedonsupervisedlearningapproaches AT liguangdi towardsefficientandaccuratesarscov2genomesequencetypingbasedonsupervisedlearningapproaches |