Cargando…

Using the spike protein feature to predict infection risk and monitor the evolutionary dynamic of coronavirus

BACKGROUND: Coronavirus can cross the species barrier and infect humans with a severe respiratory syndrome. SARS-CoV-2 with potential origin of bat is still circulating in China. In this study, a prediction model is proposed to evaluate the infection risk of non-human-origin coronavirus for early wa...

Descripción completa

Detalles Bibliográficos
Autores principales: Qiang, Xiao-Li, Xu, Peng, Fang, Gang, Liu, Wen-Bin, Kou, Zheng
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7093988/
https://www.ncbi.nlm.nih.gov/pubmed/32209118
http://dx.doi.org/10.1186/s40249-020-00649-8
_version_ 1783510384303407104
author Qiang, Xiao-Li
Xu, Peng
Fang, Gang
Liu, Wen-Bin
Kou, Zheng
author_facet Qiang, Xiao-Li
Xu, Peng
Fang, Gang
Liu, Wen-Bin
Kou, Zheng
author_sort Qiang, Xiao-Li
collection PubMed
description BACKGROUND: Coronavirus can cross the species barrier and infect humans with a severe respiratory syndrome. SARS-CoV-2 with potential origin of bat is still circulating in China. In this study, a prediction model is proposed to evaluate the infection risk of non-human-origin coronavirus for early warning. METHODS: The spike protein sequences of 2666 coronaviruses were collected from 2019 Novel Coronavirus Resource (2019nCoVR) Database of China National Genomics Data Center on Jan 29, 2020. A total of 507 human-origin viruses were regarded as positive samples, whereas 2159 non-human-origin viruses were regarded as negative. To capture the key information of the spike protein, three feature encoding algorithms (amino acid composition, AAC; parallel correlation-based pseudo-amino-acid composition, PC-PseAAC and G-gap dipeptide composition, GGAP) were used to train 41 random forest models. The optimal feature with the best performance was identified by the multidimensional scaling method, which was used to explore the pattern of human coronavirus. RESULTS: The 10-fold cross-validation results showed that well performance was achieved with the use of the GGAP (g = 3) feature. The predictive model achieved the maximum ACC of 98.18% coupled with the Matthews correlation coefficient (MCC) of 0.9638. Seven clusters for human coronaviruses (229E, NL63, OC43, HKU1, MERS-CoV, SARS-CoV, and SARS-CoV-2) were found. The cluster for SARS-CoV-2 was very close to that for SARS-CoV, which suggests that both of viruses have the same human receptor (angiotensin converting enzyme II). The big gap in the distance curve suggests that the origin of SARS-CoV-2 is not clear and further surveillance in the field should be made continuously. The smooth distance curve for SARS-CoV suggests that its close relatives still exist in nature and public health is challenged as usual. CONCLUSIONS: The optimal feature (GGAP, g = 3) performed well in terms of predicting infection risk and could be used to explore the evolutionary dynamic in a simple, fast and large-scale manner. The study may be beneficial for the surveillance of the genome mutation of coronavirus in the field.
format Online
Article
Text
id pubmed-7093988
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-70939882020-03-27 Using the spike protein feature to predict infection risk and monitor the evolutionary dynamic of coronavirus Qiang, Xiao-Li Xu, Peng Fang, Gang Liu, Wen-Bin Kou, Zheng Infect Dis Poverty Research Article BACKGROUND: Coronavirus can cross the species barrier and infect humans with a severe respiratory syndrome. SARS-CoV-2 with potential origin of bat is still circulating in China. In this study, a prediction model is proposed to evaluate the infection risk of non-human-origin coronavirus for early warning. METHODS: The spike protein sequences of 2666 coronaviruses were collected from 2019 Novel Coronavirus Resource (2019nCoVR) Database of China National Genomics Data Center on Jan 29, 2020. A total of 507 human-origin viruses were regarded as positive samples, whereas 2159 non-human-origin viruses were regarded as negative. To capture the key information of the spike protein, three feature encoding algorithms (amino acid composition, AAC; parallel correlation-based pseudo-amino-acid composition, PC-PseAAC and G-gap dipeptide composition, GGAP) were used to train 41 random forest models. The optimal feature with the best performance was identified by the multidimensional scaling method, which was used to explore the pattern of human coronavirus. RESULTS: The 10-fold cross-validation results showed that well performance was achieved with the use of the GGAP (g = 3) feature. The predictive model achieved the maximum ACC of 98.18% coupled with the Matthews correlation coefficient (MCC) of 0.9638. Seven clusters for human coronaviruses (229E, NL63, OC43, HKU1, MERS-CoV, SARS-CoV, and SARS-CoV-2) were found. The cluster for SARS-CoV-2 was very close to that for SARS-CoV, which suggests that both of viruses have the same human receptor (angiotensin converting enzyme II). The big gap in the distance curve suggests that the origin of SARS-CoV-2 is not clear and further surveillance in the field should be made continuously. The smooth distance curve for SARS-CoV suggests that its close relatives still exist in nature and public health is challenged as usual. CONCLUSIONS: The optimal feature (GGAP, g = 3) performed well in terms of predicting infection risk and could be used to explore the evolutionary dynamic in a simple, fast and large-scale manner. The study may be beneficial for the surveillance of the genome mutation of coronavirus in the field. BioMed Central 2020-03-25 /pmc/articles/PMC7093988/ /pubmed/32209118 http://dx.doi.org/10.1186/s40249-020-00649-8 Text en © The Author(s) 2020 Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research Article
Qiang, Xiao-Li
Xu, Peng
Fang, Gang
Liu, Wen-Bin
Kou, Zheng
Using the spike protein feature to predict infection risk and monitor the evolutionary dynamic of coronavirus
title Using the spike protein feature to predict infection risk and monitor the evolutionary dynamic of coronavirus
title_full Using the spike protein feature to predict infection risk and monitor the evolutionary dynamic of coronavirus
title_fullStr Using the spike protein feature to predict infection risk and monitor the evolutionary dynamic of coronavirus
title_full_unstemmed Using the spike protein feature to predict infection risk and monitor the evolutionary dynamic of coronavirus
title_short Using the spike protein feature to predict infection risk and monitor the evolutionary dynamic of coronavirus
title_sort using the spike protein feature to predict infection risk and monitor the evolutionary dynamic of coronavirus
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7093988/
https://www.ncbi.nlm.nih.gov/pubmed/32209118
http://dx.doi.org/10.1186/s40249-020-00649-8
work_keys_str_mv AT qiangxiaoli usingthespikeproteinfeaturetopredictinfectionriskandmonitortheevolutionarydynamicofcoronavirus
AT xupeng usingthespikeproteinfeaturetopredictinfectionriskandmonitortheevolutionarydynamicofcoronavirus
AT fanggang usingthespikeproteinfeaturetopredictinfectionriskandmonitortheevolutionarydynamicofcoronavirus
AT liuwenbin usingthespikeproteinfeaturetopredictinfectionriskandmonitortheevolutionarydynamicofcoronavirus
AT kouzheng usingthespikeproteinfeaturetopredictinfectionriskandmonitortheevolutionarydynamicofcoronavirus