Cargando…

Predicting COVID-19 disease severity from SARS-CoV-2 spike protein sequence by mixed effects machine learning

Epidemiological studies show that COVID-19 variants-of-concern, like Delta and Omicron, pose different risks for severe disease, but they typically lack sequence-level information for the virus. Studies which do obtain viral genome sequences are generally limited in time, location, and population sc...

Descripción completa

Detalles Bibliográficos
Autores principales:	Sokhansanj, Bahrad A., Rosen, Gail L.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	The Author(s). Published by Elsevier Ltd. 2022
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9384346/ https://www.ncbi.nlm.nih.gov/pubmed/36041271 http://dx.doi.org/10.1016/j.compbiomed.2022.105969

_version_	1784769433163530240
author	Sokhansanj, Bahrad A. Rosen, Gail L.
author_facet	Sokhansanj, Bahrad A. Rosen, Gail L.
author_sort	Sokhansanj, Bahrad A.
collection	PubMed
description	Epidemiological studies show that COVID-19 variants-of-concern, like Delta and Omicron, pose different risks for severe disease, but they typically lack sequence-level information for the virus. Studies which do obtain viral genome sequences are generally limited in time, location, and population scope. Retrospective meta-analyses require time-consuming data extraction from heterogeneous formats and are limited to publicly available reports. Fortuitously, a subset of GISAID, the global SARS-CoV-2 sequence repository, includes “patient status” metadata that can indicate whether a sequence record is associated with mild or severe disease. While GISAID lacks data on comorbidities relevant to severity, such as obesity and chronic disease, it does include metadata for age and sex to use as additional attributes in modeling. With these caveats, previous efforts have demonstrated that genotype-patient status models can be fit to GISAID data, particularly when country-of-origin is used as an additional feature. But are these models robust and biologically meaningful? This paper shows that, in fact, temporal and geographic biases in sequences submitted to GISAID, as well as the evolving pandemic response, particularly reduction in severe disease due to vaccination, create complex issues for model development and interpretation. This paper poses a potential solution: efficient mixed effects machine learning using GPBoost, treating country as a random effect group. Training and validation using temporally split GISAID data and emerging Omicron variants demonstrates that GPBoost models are more predictive of the impact of spike protein mutations on patient outcomes than fixed effect XGBoost, LightGBM, random forests, and elastic net logistic regression models.
format	Online Article Text
id	pubmed-9384346
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	The Author(s). Published by Elsevier Ltd.
record_format	MEDLINE/PubMed
spelling	pubmed-93843462022-08-17 Predicting COVID-19 disease severity from SARS-CoV-2 spike protein sequence by mixed effects machine learning Sokhansanj, Bahrad A. Rosen, Gail L. Comput Biol Med Article Epidemiological studies show that COVID-19 variants-of-concern, like Delta and Omicron, pose different risks for severe disease, but they typically lack sequence-level information for the virus. Studies which do obtain viral genome sequences are generally limited in time, location, and population scope. Retrospective meta-analyses require time-consuming data extraction from heterogeneous formats and are limited to publicly available reports. Fortuitously, a subset of GISAID, the global SARS-CoV-2 sequence repository, includes “patient status” metadata that can indicate whether a sequence record is associated with mild or severe disease. While GISAID lacks data on comorbidities relevant to severity, such as obesity and chronic disease, it does include metadata for age and sex to use as additional attributes in modeling. With these caveats, previous efforts have demonstrated that genotype-patient status models can be fit to GISAID data, particularly when country-of-origin is used as an additional feature. But are these models robust and biologically meaningful? This paper shows that, in fact, temporal and geographic biases in sequences submitted to GISAID, as well as the evolving pandemic response, particularly reduction in severe disease due to vaccination, create complex issues for model development and interpretation. This paper poses a potential solution: efficient mixed effects machine learning using GPBoost, treating country as a random effect group. Training and validation using temporally split GISAID data and emerging Omicron variants demonstrates that GPBoost models are more predictive of the impact of spike protein mutations on patient outcomes than fixed effect XGBoost, LightGBM, random forests, and elastic net logistic regression models. The Author(s). Published by Elsevier Ltd. 2022-10 2022-08-17 /pmc/articles/PMC9384346/ /pubmed/36041271 http://dx.doi.org/10.1016/j.compbiomed.2022.105969 Text en © 2022 The Author(s) Since January 2020 Elsevier has created a COVID-19 resource centre with free information in English and Mandarin on the novel coronavirus COVID-19. The COVID-19 resource centre is hosted on Elsevier Connect, the company's public news and information website. Elsevier hereby grants permission to make all its COVID-19-related research that is available on the COVID-19 resource centre - including this research content - immediately available in PubMed Central and other publicly funded repositories, such as the WHO COVID database with rights for unrestricted research re-use and analyses in any form or by any means with acknowledgement of the original source. These permissions are granted for free by Elsevier for as long as the COVID-19 resource centre remains active.
spellingShingle	Article Sokhansanj, Bahrad A. Rosen, Gail L. Predicting COVID-19 disease severity from SARS-CoV-2 spike protein sequence by mixed effects machine learning
title	Predicting COVID-19 disease severity from SARS-CoV-2 spike protein sequence by mixed effects machine learning
title_full	Predicting COVID-19 disease severity from SARS-CoV-2 spike protein sequence by mixed effects machine learning
title_fullStr	Predicting COVID-19 disease severity from SARS-CoV-2 spike protein sequence by mixed effects machine learning
title_full_unstemmed	Predicting COVID-19 disease severity from SARS-CoV-2 spike protein sequence by mixed effects machine learning
title_short	Predicting COVID-19 disease severity from SARS-CoV-2 spike protein sequence by mixed effects machine learning
title_sort	predicting covid-19 disease severity from sars-cov-2 spike protein sequence by mixed effects machine learning
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9384346/ https://www.ncbi.nlm.nih.gov/pubmed/36041271 http://dx.doi.org/10.1016/j.compbiomed.2022.105969
work_keys_str_mv	AT sokhansanjbahrada predictingcovid19diseaseseverityfromsarscov2spikeproteinsequencebymixedeffectsmachinelearning AT rosengaill predictingcovid19diseaseseverityfromsarscov2spikeproteinsequencebymixedeffectsmachinelearning

Predicting COVID-19 disease severity from SARS-CoV-2 spike protein sequence by mixed effects machine learning

Ejemplares similares