Cargando…

Extracting relevant predictive variables for COVID-19 severity prognosis: An exhaustive comparison of feature selection techniques

With the COVID-19 pandemic having caused unprecedented numbers of infections and deaths, large research efforts have been undertaken to increase our understanding of the disease and the factors which determine diverse clinical evolutions. Here we focused on a fully data-driven exploration regarding...

Descripción completa

Detalles Bibliográficos
Autores principales: Hayet-Otero, Miren, García-García, Fernando, Lee, Dae-Jin, Martínez-Minaya, Joaquín, España Yandiola, Pedro Pablo, Urrutia Landa, Isabel, Nieves Ermecheo, Mónica, Quintana, José María, Menéndez, Rosario, Torres, Antoni, Zalacain Jorge, Rafael, Arostegui, Inmaculada
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10101453/
https://www.ncbi.nlm.nih.gov/pubmed/37053151
http://dx.doi.org/10.1371/journal.pone.0284150
_version_ 1785025519818899456
author Hayet-Otero, Miren
García-García, Fernando
Lee, Dae-Jin
Martínez-Minaya, Joaquín
España Yandiola, Pedro Pablo
Urrutia Landa, Isabel
Nieves Ermecheo, Mónica
Quintana, José María
Menéndez, Rosario
Torres, Antoni
Zalacain Jorge, Rafael
Arostegui, Inmaculada
author_facet Hayet-Otero, Miren
García-García, Fernando
Lee, Dae-Jin
Martínez-Minaya, Joaquín
España Yandiola, Pedro Pablo
Urrutia Landa, Isabel
Nieves Ermecheo, Mónica
Quintana, José María
Menéndez, Rosario
Torres, Antoni
Zalacain Jorge, Rafael
Arostegui, Inmaculada
author_sort Hayet-Otero, Miren
collection PubMed
description With the COVID-19 pandemic having caused unprecedented numbers of infections and deaths, large research efforts have been undertaken to increase our understanding of the disease and the factors which determine diverse clinical evolutions. Here we focused on a fully data-driven exploration regarding which factors (clinical or otherwise) were most informative for SARS-CoV-2 pneumonia severity prediction via machine learning (ML). In particular, feature selection techniques (FS), designed to reduce the dimensionality of data, allowed us to characterize which of our variables were the most useful for ML prognosis. We conducted a multi-centre clinical study, enrolling n = 1548 patients hospitalized due to SARS-CoV-2 pneumonia: where 792, 238, and 598 patients experienced low, medium and high-severity evolutions, respectively. Up to 106 patient-specific clinical variables were collected at admission, although 14 of them had to be discarded for containing ⩾60% missing values. Alongside 7 socioeconomic attributes and 32 exposures to air pollution (chronic and acute), these became d = 148 features after variable encoding. We addressed this ordinal classification problem both as a ML classification and regression task. Two imputation techniques for missing data were explored, along with a total of 166 unique FS algorithm configurations: 46 filters, 100 wrappers and 20 embeddeds. Of these, 21 setups achieved satisfactory bootstrap stability (⩾0.70) with reasonable computation times: 16 filters, 2 wrappers, and 3 embeddeds. The subsets of features selected by each technique showed modest Jaccard similarities across them. However, they consistently pointed out the importance of certain explanatory variables. Namely: patient’s C-reactive protein (CRP), pneumonia severity index (PSI), respiratory rate (RR) and oxygen levels –saturation Sp O2, quotients Sp O2/RR and arterial Sat O2/Fi O2–, the neutrophil-to-lymphocyte ratio (NLR) –to certain extent, also neutrophil and lymphocyte counts separately–, lactate dehydrogenase (LDH), and procalcitonin (PCT) levels in blood. A remarkable agreement has been found a posteriori between our strategy and independent clinical research works investigating risk factors for COVID-19 severity. Hence, these findings stress the suitability of this type of fully data-driven approaches for knowledge extraction, as a complementary to clinical perspectives.
format Online
Article
Text
id pubmed-10101453
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-101014532023-04-14 Extracting relevant predictive variables for COVID-19 severity prognosis: An exhaustive comparison of feature selection techniques Hayet-Otero, Miren García-García, Fernando Lee, Dae-Jin Martínez-Minaya, Joaquín España Yandiola, Pedro Pablo Urrutia Landa, Isabel Nieves Ermecheo, Mónica Quintana, José María Menéndez, Rosario Torres, Antoni Zalacain Jorge, Rafael Arostegui, Inmaculada PLoS One Research Article With the COVID-19 pandemic having caused unprecedented numbers of infections and deaths, large research efforts have been undertaken to increase our understanding of the disease and the factors which determine diverse clinical evolutions. Here we focused on a fully data-driven exploration regarding which factors (clinical or otherwise) were most informative for SARS-CoV-2 pneumonia severity prediction via machine learning (ML). In particular, feature selection techniques (FS), designed to reduce the dimensionality of data, allowed us to characterize which of our variables were the most useful for ML prognosis. We conducted a multi-centre clinical study, enrolling n = 1548 patients hospitalized due to SARS-CoV-2 pneumonia: where 792, 238, and 598 patients experienced low, medium and high-severity evolutions, respectively. Up to 106 patient-specific clinical variables were collected at admission, although 14 of them had to be discarded for containing ⩾60% missing values. Alongside 7 socioeconomic attributes and 32 exposures to air pollution (chronic and acute), these became d = 148 features after variable encoding. We addressed this ordinal classification problem both as a ML classification and regression task. Two imputation techniques for missing data were explored, along with a total of 166 unique FS algorithm configurations: 46 filters, 100 wrappers and 20 embeddeds. Of these, 21 setups achieved satisfactory bootstrap stability (⩾0.70) with reasonable computation times: 16 filters, 2 wrappers, and 3 embeddeds. The subsets of features selected by each technique showed modest Jaccard similarities across them. However, they consistently pointed out the importance of certain explanatory variables. Namely: patient’s C-reactive protein (CRP), pneumonia severity index (PSI), respiratory rate (RR) and oxygen levels –saturation Sp O2, quotients Sp O2/RR and arterial Sat O2/Fi O2–, the neutrophil-to-lymphocyte ratio (NLR) –to certain extent, also neutrophil and lymphocyte counts separately–, lactate dehydrogenase (LDH), and procalcitonin (PCT) levels in blood. A remarkable agreement has been found a posteriori between our strategy and independent clinical research works investigating risk factors for COVID-19 severity. Hence, these findings stress the suitability of this type of fully data-driven approaches for knowledge extraction, as a complementary to clinical perspectives. Public Library of Science 2023-04-13 /pmc/articles/PMC10101453/ /pubmed/37053151 http://dx.doi.org/10.1371/journal.pone.0284150 Text en © 2023 Hayet-Otero et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Hayet-Otero, Miren
García-García, Fernando
Lee, Dae-Jin
Martínez-Minaya, Joaquín
España Yandiola, Pedro Pablo
Urrutia Landa, Isabel
Nieves Ermecheo, Mónica
Quintana, José María
Menéndez, Rosario
Torres, Antoni
Zalacain Jorge, Rafael
Arostegui, Inmaculada
Extracting relevant predictive variables for COVID-19 severity prognosis: An exhaustive comparison of feature selection techniques
title Extracting relevant predictive variables for COVID-19 severity prognosis: An exhaustive comparison of feature selection techniques
title_full Extracting relevant predictive variables for COVID-19 severity prognosis: An exhaustive comparison of feature selection techniques
title_fullStr Extracting relevant predictive variables for COVID-19 severity prognosis: An exhaustive comparison of feature selection techniques
title_full_unstemmed Extracting relevant predictive variables for COVID-19 severity prognosis: An exhaustive comparison of feature selection techniques
title_short Extracting relevant predictive variables for COVID-19 severity prognosis: An exhaustive comparison of feature selection techniques
title_sort extracting relevant predictive variables for covid-19 severity prognosis: an exhaustive comparison of feature selection techniques
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10101453/
https://www.ncbi.nlm.nih.gov/pubmed/37053151
http://dx.doi.org/10.1371/journal.pone.0284150
work_keys_str_mv AT hayetoteromiren extractingrelevantpredictivevariablesforcovid19severityprognosisanexhaustivecomparisonoffeatureselectiontechniques
AT garciagarciafernando extractingrelevantpredictivevariablesforcovid19severityprognosisanexhaustivecomparisonoffeatureselectiontechniques
AT leedaejin extractingrelevantpredictivevariablesforcovid19severityprognosisanexhaustivecomparisonoffeatureselectiontechniques
AT martinezminayajoaquin extractingrelevantpredictivevariablesforcovid19severityprognosisanexhaustivecomparisonoffeatureselectiontechniques
AT espanayandiolapedropablo extractingrelevantpredictivevariablesforcovid19severityprognosisanexhaustivecomparisonoffeatureselectiontechniques
AT urrutialandaisabel extractingrelevantpredictivevariablesforcovid19severityprognosisanexhaustivecomparisonoffeatureselectiontechniques
AT nievesermecheomonica extractingrelevantpredictivevariablesforcovid19severityprognosisanexhaustivecomparisonoffeatureselectiontechniques
AT quintanajosemaria extractingrelevantpredictivevariablesforcovid19severityprognosisanexhaustivecomparisonoffeatureselectiontechniques
AT menendezrosario extractingrelevantpredictivevariablesforcovid19severityprognosisanexhaustivecomparisonoffeatureselectiontechniques
AT torresantoni extractingrelevantpredictivevariablesforcovid19severityprognosisanexhaustivecomparisonoffeatureselectiontechniques
AT zalacainjorgerafael extractingrelevantpredictivevariablesforcovid19severityprognosisanexhaustivecomparisonoffeatureselectiontechniques
AT arosteguiinmaculada extractingrelevantpredictivevariablesforcovid19severityprognosisanexhaustivecomparisonoffeatureselectiontechniques
AT extractingrelevantpredictivevariablesforcovid19severityprognosisanexhaustivecomparisonoffeatureselectiontechniques