Cargando…

Assessing the Resilience of Machine Learning Classification Algorithms on SARS-CoV-2 Genome Sequences Generated with Long-Read Specific Errors

The emergence of third-generation single-molecule sequencing (TGS) technology has revolutionized the generation of long reads, which are essential for genome assembly and have been widely employed in sequencing the SARS-CoV-2 virus during the COVID-19 pandemic. Although long-read sequencing has been...

Descripción completa

Detalles Bibliográficos
Autores principales:	Sahoo, Bikram, Ali, Sarwan, Chen, Pin-Yu, Patterson, Murray, Zelikovsky, Alexander
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2023
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10296223/ https://www.ncbi.nlm.nih.gov/pubmed/37371514 http://dx.doi.org/10.3390/biom13060934

_version_	1785063607064592384
author	Sahoo, Bikram Ali, Sarwan Chen, Pin-Yu Patterson, Murray Zelikovsky, Alexander
author_facet	Sahoo, Bikram Ali, Sarwan Chen, Pin-Yu Patterson, Murray Zelikovsky, Alexander
author_sort	Sahoo, Bikram
collection	PubMed
description	The emergence of third-generation single-molecule sequencing (TGS) technology has revolutionized the generation of long reads, which are essential for genome assembly and have been widely employed in sequencing the SARS-CoV-2 virus during the COVID-19 pandemic. Although long-read sequencing has been crucial in understanding the evolution and transmission of the virus, the high error rate associated with these reads can lead to inadequate genome assembly and downstream biological interpretation. In this study, we evaluate the accuracy and robustness of machine learning (ML) models using six different embedding techniques on SARS-CoV-2 error-incorporated genome sequences. Our analysis includes two types of error-incorporated genome sequences: those generated using simulation tools to emulate error profiles of long-read sequencing platforms and those generated by introducing random errors. We show that the spaced k-mers embedding method achieves high accuracy in classifying error-free SARS-CoV-2 genome sequences, and the spaced k-mers and weighted k-mers embedding methods are highly accurate in predicting error-incorporated sequences. The fixed-length vectors generated by these methods contribute to the high accuracy achieved. Our study provides valuable insights for researchers to effectively evaluate ML models and gain a better understanding of the approach for accurate identification of critical SARS-CoV-2 genome sequences.
format	Online Article Text
id	pubmed-10296223
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-102962232023-06-28 Assessing the Resilience of Machine Learning Classification Algorithms on SARS-CoV-2 Genome Sequences Generated with Long-Read Specific Errors Sahoo, Bikram Ali, Sarwan Chen, Pin-Yu Patterson, Murray Zelikovsky, Alexander Biomolecules Article The emergence of third-generation single-molecule sequencing (TGS) technology has revolutionized the generation of long reads, which are essential for genome assembly and have been widely employed in sequencing the SARS-CoV-2 virus during the COVID-19 pandemic. Although long-read sequencing has been crucial in understanding the evolution and transmission of the virus, the high error rate associated with these reads can lead to inadequate genome assembly and downstream biological interpretation. In this study, we evaluate the accuracy and robustness of machine learning (ML) models using six different embedding techniques on SARS-CoV-2 error-incorporated genome sequences. Our analysis includes two types of error-incorporated genome sequences: those generated using simulation tools to emulate error profiles of long-read sequencing platforms and those generated by introducing random errors. We show that the spaced k-mers embedding method achieves high accuracy in classifying error-free SARS-CoV-2 genome sequences, and the spaced k-mers and weighted k-mers embedding methods are highly accurate in predicting error-incorporated sequences. The fixed-length vectors generated by these methods contribute to the high accuracy achieved. Our study provides valuable insights for researchers to effectively evaluate ML models and gain a better understanding of the approach for accurate identification of critical SARS-CoV-2 genome sequences. MDPI 2023-06-02 /pmc/articles/PMC10296223/ /pubmed/37371514 http://dx.doi.org/10.3390/biom13060934 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Sahoo, Bikram Ali, Sarwan Chen, Pin-Yu Patterson, Murray Zelikovsky, Alexander Assessing the Resilience of Machine Learning Classification Algorithms on SARS-CoV-2 Genome Sequences Generated with Long-Read Specific Errors
title	Assessing the Resilience of Machine Learning Classification Algorithms on SARS-CoV-2 Genome Sequences Generated with Long-Read Specific Errors
title_full	Assessing the Resilience of Machine Learning Classification Algorithms on SARS-CoV-2 Genome Sequences Generated with Long-Read Specific Errors
title_fullStr	Assessing the Resilience of Machine Learning Classification Algorithms on SARS-CoV-2 Genome Sequences Generated with Long-Read Specific Errors
title_full_unstemmed	Assessing the Resilience of Machine Learning Classification Algorithms on SARS-CoV-2 Genome Sequences Generated with Long-Read Specific Errors
title_short	Assessing the Resilience of Machine Learning Classification Algorithms on SARS-CoV-2 Genome Sequences Generated with Long-Read Specific Errors
title_sort	assessing the resilience of machine learning classification algorithms on sars-cov-2 genome sequences generated with long-read specific errors
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10296223/ https://www.ncbi.nlm.nih.gov/pubmed/37371514 http://dx.doi.org/10.3390/biom13060934
work_keys_str_mv	AT sahoobikram assessingtheresilienceofmachinelearningclassificationalgorithmsonsarscov2genomesequencesgeneratedwithlongreadspecificerrors AT alisarwan assessingtheresilienceofmachinelearningclassificationalgorithmsonsarscov2genomesequencesgeneratedwithlongreadspecificerrors AT chenpinyu assessingtheresilienceofmachinelearningclassificationalgorithmsonsarscov2genomesequencesgeneratedwithlongreadspecificerrors AT pattersonmurray assessingtheresilienceofmachinelearningclassificationalgorithmsonsarscov2genomesequencesgeneratedwithlongreadspecificerrors AT zelikovskyalexander assessingtheresilienceofmachinelearningclassificationalgorithmsonsarscov2genomesequencesgeneratedwithlongreadspecificerrors

Assessing the Resilience of Machine Learning Classification Algorithms on SARS-CoV-2 Genome Sequences Generated with Long-Read Specific Errors

Ejemplares similares