Cargando…

Computational predictions for protein sequences of COVID-19 virus via machine learning algorithms

The rapid spread of coronavirus disease (COVID-19) has become a worldwide pandemic and affected more than 15 million patients reported in 27 countries. Therefore, the computational biology carrying this virus that correlates with the human population urgently needs to be understood. In this paper, t...

Descripción completa

Detalles Bibliográficos
Autores principales:	Afify, Heba M., Zanaty, Muhammad S.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer Berlin Heidelberg 2021
Materias:	Original Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8295007/ https://www.ncbi.nlm.nih.gov/pubmed/34291385 http://dx.doi.org/10.1007/s11517-021-02412-z

_version_	1783725351738802176
author	Afify, Heba M. Zanaty, Muhammad S.
author_facet	Afify, Heba M. Zanaty, Muhammad S.
author_sort	Afify, Heba M.
collection	PubMed
description	The rapid spread of coronavirus disease (COVID-19) has become a worldwide pandemic and affected more than 15 million patients reported in 27 countries. Therefore, the computational biology carrying this virus that correlates with the human population urgently needs to be understood. In this paper, the classification of the human protein sequences of COVID-19, according to the country, is presented based on machine learning algorithms. The proposed model is based on distinguishing 9238 sequences using three stages, including data preprocessing, data labeling, and classification. In the first stage, data preprocessing’s function converts the amino acids of COVID-19 protein sequences into eight groups of numbers based on the amino acids’ volume and dipole. It is based on the conjoint triad (CT) method. In the second stage, there are two methods for labeling data from 27 countries from 0 to 26. The first method is based on selecting one number for each country according to the code numbers of countries, while the second method is based on binary elements for each country. According to their countries, machine learning algorithms are used to discover different COVID-19 protein sequences in the last stage. The obtained results demonstrate 100% accuracy, 100% sensitivity, and 90% specificity via the country-based binary labeling method with a linear support vector machine (SVM) classifier. Furthermore, with significant infection data, the USA is more prone to correct classification compared to other countries with fewer data. The unbalanced data for COVID-19 protein sequences is considered a major issue, especially as the US’s available data represents 76% of a total of 9238 sequences. The proposed model will act as a prediction tool for the COVID-19 protein sequences in different countries. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1007/s11517-021-02412-z.
format	Online Article Text
id	pubmed-8295007
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Springer Berlin Heidelberg
record_format	MEDLINE/PubMed
spelling	pubmed-82950072021-07-22 Computational predictions for protein sequences of COVID-19 virus via machine learning algorithms Afify, Heba M. Zanaty, Muhammad S. Med Biol Eng Comput Original Article The rapid spread of coronavirus disease (COVID-19) has become a worldwide pandemic and affected more than 15 million patients reported in 27 countries. Therefore, the computational biology carrying this virus that correlates with the human population urgently needs to be understood. In this paper, the classification of the human protein sequences of COVID-19, according to the country, is presented based on machine learning algorithms. The proposed model is based on distinguishing 9238 sequences using three stages, including data preprocessing, data labeling, and classification. In the first stage, data preprocessing’s function converts the amino acids of COVID-19 protein sequences into eight groups of numbers based on the amino acids’ volume and dipole. It is based on the conjoint triad (CT) method. In the second stage, there are two methods for labeling data from 27 countries from 0 to 26. The first method is based on selecting one number for each country according to the code numbers of countries, while the second method is based on binary elements for each country. According to their countries, machine learning algorithms are used to discover different COVID-19 protein sequences in the last stage. The obtained results demonstrate 100% accuracy, 100% sensitivity, and 90% specificity via the country-based binary labeling method with a linear support vector machine (SVM) classifier. Furthermore, with significant infection data, the USA is more prone to correct classification compared to other countries with fewer data. The unbalanced data for COVID-19 protein sequences is considered a major issue, especially as the US’s available data represents 76% of a total of 9238 sequences. The proposed model will act as a prediction tool for the COVID-19 protein sequences in different countries. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1007/s11517-021-02412-z. Springer Berlin Heidelberg 2021-07-22 2021 /pmc/articles/PMC8295007/ /pubmed/34291385 http://dx.doi.org/10.1007/s11517-021-02412-z Text en © International Federation for Medical and Biological Engineering 2021 This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.
spellingShingle	Original Article Afify, Heba M. Zanaty, Muhammad S. Computational predictions for protein sequences of COVID-19 virus via machine learning algorithms
title	Computational predictions for protein sequences of COVID-19 virus via machine learning algorithms
title_full	Computational predictions for protein sequences of COVID-19 virus via machine learning algorithms
title_fullStr	Computational predictions for protein sequences of COVID-19 virus via machine learning algorithms
title_full_unstemmed	Computational predictions for protein sequences of COVID-19 virus via machine learning algorithms
title_short	Computational predictions for protein sequences of COVID-19 virus via machine learning algorithms
title_sort	computational predictions for protein sequences of covid-19 virus via machine learning algorithms
topic	Original Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8295007/ https://www.ncbi.nlm.nih.gov/pubmed/34291385 http://dx.doi.org/10.1007/s11517-021-02412-z
work_keys_str_mv	AT afifyhebam computationalpredictionsforproteinsequencesofcovid19virusviamachinelearningalgorithms AT zanatymuhammads computationalpredictionsforproteinsequencesofcovid19virusviamachinelearningalgorithms

Computational predictions for protein sequences of COVID-19 virus via machine learning algorithms

Ejemplares similares