Cargando…
Classification of COVID-19 and Other Pathogenic Sequences: A Dinucleotide Frequency and Machine Learning Approach
The world is grappling with the COVID-19 pandemic caused by the 2019 novel SARS-CoV-2. To better understand this novel virus and its relationship with other pathogens, new methods for analyzing the genome are required. In this study, intrinsic dinucleotide genomic signatures were analyzed for whole...
Formato: | Online Artículo Texto |
---|---|
Lenguaje: | English |
Publicado: |
IEEE
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8675546/ https://www.ncbi.nlm.nih.gov/pubmed/34976561 http://dx.doi.org/10.1109/ACCESS.2020.3031387 |
_version_ | 1784615891329089536 |
---|---|
collection | PubMed |
description | The world is grappling with the COVID-19 pandemic caused by the 2019 novel SARS-CoV-2. To better understand this novel virus and its relationship with other pathogens, new methods for analyzing the genome are required. In this study, intrinsic dinucleotide genomic signatures were analyzed for whole genome sequence data of eight pathogenic species, including SARS-CoV-2. The genome sequences were transformed into dinucleotide relative frequencies and classified using the extreme gradient boosting (XGBoost) model. The classification models were trained to a) distinguish between the sequences of all eight species and b) distinguish between sequences of SARS-CoV-2 that originate from different geographic regions. Our method attained 100% in all performance metrics and for all tasks in the eight-species classification problem. Moreover, the models achieved 67% balanced accuracy for the task of classifying the SARS-CoV-2 sequences into the six continental regions and achieved 86% balanced accuracy for the task of classifying SARS-CoV-2 samples as either originating from Asia or not. Analysis of the dinucleotide genomic profiles of the eight species revealed a similarity between the SARS-CoV-2 and MERS-CoV viral sequences. Further analysis of SARS-CoV-2 viral sequences from the six continents revealed that samples from Oceania had the highest frequency of TT dinucleotides as well as the lowest CG frequency compared to the other continents. The dinucleotide signatures of AC, AG,CA, CT, GA, GT, TC, and TG were well conserved across most genomes, while the frequencies of other dinucleotide signatures varied considerably. Altogether, the results from this study demonstrate the utility of dinucleotide relative frequencies for discriminating and identifying similar species. |
format | Online Article Text |
id | pubmed-8675546 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | IEEE |
record_format | MEDLINE/PubMed |
spelling | pubmed-86755462021-12-29 Classification of COVID-19 and Other Pathogenic Sequences: A Dinucleotide Frequency and Machine Learning Approach IEEE Access Computational and Artificial Intelligence The world is grappling with the COVID-19 pandemic caused by the 2019 novel SARS-CoV-2. To better understand this novel virus and its relationship with other pathogens, new methods for analyzing the genome are required. In this study, intrinsic dinucleotide genomic signatures were analyzed for whole genome sequence data of eight pathogenic species, including SARS-CoV-2. The genome sequences were transformed into dinucleotide relative frequencies and classified using the extreme gradient boosting (XGBoost) model. The classification models were trained to a) distinguish between the sequences of all eight species and b) distinguish between sequences of SARS-CoV-2 that originate from different geographic regions. Our method attained 100% in all performance metrics and for all tasks in the eight-species classification problem. Moreover, the models achieved 67% balanced accuracy for the task of classifying the SARS-CoV-2 sequences into the six continental regions and achieved 86% balanced accuracy for the task of classifying SARS-CoV-2 samples as either originating from Asia or not. Analysis of the dinucleotide genomic profiles of the eight species revealed a similarity between the SARS-CoV-2 and MERS-CoV viral sequences. Further analysis of SARS-CoV-2 viral sequences from the six continents revealed that samples from Oceania had the highest frequency of TT dinucleotides as well as the lowest CG frequency compared to the other continents. The dinucleotide signatures of AC, AG,CA, CT, GA, GT, TC, and TG were well conserved across most genomes, while the frequencies of other dinucleotide signatures varied considerably. Altogether, the results from this study demonstrate the utility of dinucleotide relative frequencies for discriminating and identifying similar species. IEEE 2020-10-15 /pmc/articles/PMC8675546/ /pubmed/34976561 http://dx.doi.org/10.1109/ACCESS.2020.3031387 Text en This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ |
spellingShingle | Computational and Artificial Intelligence Classification of COVID-19 and Other Pathogenic Sequences: A Dinucleotide Frequency and Machine Learning Approach |
title | Classification of COVID-19 and Other Pathogenic Sequences: A Dinucleotide Frequency and Machine Learning Approach |
title_full | Classification of COVID-19 and Other Pathogenic Sequences: A Dinucleotide Frequency and Machine Learning Approach |
title_fullStr | Classification of COVID-19 and Other Pathogenic Sequences: A Dinucleotide Frequency and Machine Learning Approach |
title_full_unstemmed | Classification of COVID-19 and Other Pathogenic Sequences: A Dinucleotide Frequency and Machine Learning Approach |
title_short | Classification of COVID-19 and Other Pathogenic Sequences: A Dinucleotide Frequency and Machine Learning Approach |
title_sort | classification of covid-19 and other pathogenic sequences: a dinucleotide frequency and machine learning approach |
topic | Computational and Artificial Intelligence |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8675546/ https://www.ncbi.nlm.nih.gov/pubmed/34976561 http://dx.doi.org/10.1109/ACCESS.2020.3031387 |
work_keys_str_mv | AT classificationofcovid19andotherpathogenicsequencesadinucleotidefrequencyandmachinelearningapproach AT classificationofcovid19andotherpathogenicsequencesadinucleotidefrequencyandmachinelearningapproach AT classificationofcovid19andotherpathogenicsequencesadinucleotidefrequencyandmachinelearningapproach AT classificationofcovid19andotherpathogenicsequencesadinucleotidefrequencyandmachinelearningapproach AT classificationofcovid19andotherpathogenicsequencesadinucleotidefrequencyandmachinelearningapproach AT classificationofcovid19andotherpathogenicsequencesadinucleotidefrequencyandmachinelearningapproach AT classificationofcovid19andotherpathogenicsequencesadinucleotidefrequencyandmachinelearningapproach |