Cargando…

Benchmarking machine learning robustness in Covid-19 genome sequence classification

The rapid spread of the COVID-19 pandemic has resulted in an unprecedented amount of sequence data of the SARS-CoV-2 genome—millions of sequences and counting. This amount of data, while being orders of magnitude beyond the capacity of traditional approaches to understanding the diversity, dynamics,...

Descripción completa

Detalles Bibliográficos
Autores principales:	Ali, Sarwan, Sahoo, Bikram, Zelikovsky, Alexander, Chen, Pin-Yu, Patterson, Murray
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Nature Publishing Group UK 2023
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10010240/ https://www.ncbi.nlm.nih.gov/pubmed/36914815 http://dx.doi.org/10.1038/s41598-023-31368-3

_version_	1784906152264335360
author	Ali, Sarwan Sahoo, Bikram Zelikovsky, Alexander Chen, Pin-Yu Patterson, Murray
author_facet	Ali, Sarwan Sahoo, Bikram Zelikovsky, Alexander Chen, Pin-Yu Patterson, Murray
author_sort	Ali, Sarwan
collection	PubMed
description	The rapid spread of the COVID-19 pandemic has resulted in an unprecedented amount of sequence data of the SARS-CoV-2 genome—millions of sequences and counting. This amount of data, while being orders of magnitude beyond the capacity of traditional approaches to understanding the diversity, dynamics, and evolution of viruses, is nonetheless a rich resource for machine learning (ML) approaches as alternatives for extracting such important information from these data. It is of hence utmost importance to design a framework for testing and benchmarking the robustness of these ML models. This paper makes the first effort (to our knowledge) to benchmark the robustness of ML models by simulating biological sequences with errors. In this paper, we introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio. We show from experiments on a wide array of ML models that some simulation-based approaches with different perturbation budgets are more robust (and accurate) than others for specific embedding methods to certain noise simulations on the input sequences. Our benchmarking framework may assist researchers in properly assessing different ML models and help them understand the behavior of the SARS-CoV-2 virus or avoid possible future pandemics.
format	Online Article Text
id	pubmed-10010240
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Nature Publishing Group UK
record_format	MEDLINE/PubMed
spelling	pubmed-100102402023-03-14 Benchmarking machine learning robustness in Covid-19 genome sequence classification Ali, Sarwan Sahoo, Bikram Zelikovsky, Alexander Chen, Pin-Yu Patterson, Murray Sci Rep Article The rapid spread of the COVID-19 pandemic has resulted in an unprecedented amount of sequence data of the SARS-CoV-2 genome—millions of sequences and counting. This amount of data, while being orders of magnitude beyond the capacity of traditional approaches to understanding the diversity, dynamics, and evolution of viruses, is nonetheless a rich resource for machine learning (ML) approaches as alternatives for extracting such important information from these data. It is of hence utmost importance to design a framework for testing and benchmarking the robustness of these ML models. This paper makes the first effort (to our knowledge) to benchmark the robustness of ML models by simulating biological sequences with errors. In this paper, we introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio. We show from experiments on a wide array of ML models that some simulation-based approaches with different perturbation budgets are more robust (and accurate) than others for specific embedding methods to certain noise simulations on the input sequences. Our benchmarking framework may assist researchers in properly assessing different ML models and help them understand the behavior of the SARS-CoV-2 virus or avoid possible future pandemics. Nature Publishing Group UK 2023-03-13 /pmc/articles/PMC10010240/ /pubmed/36914815 http://dx.doi.org/10.1038/s41598-023-31368-3 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle	Article Ali, Sarwan Sahoo, Bikram Zelikovsky, Alexander Chen, Pin-Yu Patterson, Murray Benchmarking machine learning robustness in Covid-19 genome sequence classification
title	Benchmarking machine learning robustness in Covid-19 genome sequence classification
title_full	Benchmarking machine learning robustness in Covid-19 genome sequence classification
title_fullStr	Benchmarking machine learning robustness in Covid-19 genome sequence classification
title_full_unstemmed	Benchmarking machine learning robustness in Covid-19 genome sequence classification
title_short	Benchmarking machine learning robustness in Covid-19 genome sequence classification
title_sort	benchmarking machine learning robustness in covid-19 genome sequence classification
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10010240/ https://www.ncbi.nlm.nih.gov/pubmed/36914815 http://dx.doi.org/10.1038/s41598-023-31368-3
work_keys_str_mv	AT alisarwan benchmarkingmachinelearningrobustnessincovid19genomesequenceclassification AT sahoobikram benchmarkingmachinelearningrobustnessincovid19genomesequenceclassification AT zelikovskyalexander benchmarkingmachinelearningrobustnessincovid19genomesequenceclassification AT chenpinyu benchmarkingmachinelearningrobustnessincovid19genomesequenceclassification AT pattersonmurray benchmarkingmachinelearningrobustnessincovid19genomesequenceclassification

Benchmarking machine learning robustness in Covid-19 genome sequence classification

Ejemplares similares