Cargando…

Application of information theoretic feature selection and machine learning methods for the development of genetic risk prediction models

In view of the growth of clinical risk prediction models using genetic data, there is an increasing need for studies that use appropriate methods to select the optimum number of features from a large number of genetic variants with a high degree of redundancy between features due to linkage disequil...

Descripción completa

Detalles Bibliográficos
Autores principales: Jalali-najafabadi, Farideh, Stadler, Michael, Dand, Nick, Jadon, Deepak, Soomro, Mehreen, Ho, Pauline, Marzo-Ortega, Helen, Helliwell, Philip, Korendowych, Eleanor, Simpson, Michael A., Packham, Jonathan, Smith, Catherine H., Barker, Jonathan N., McHugh, Neil, Warren, Richard B., Barton, Anne, Bowes, John
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group UK 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8640070/
https://www.ncbi.nlm.nih.gov/pubmed/34857774
http://dx.doi.org/10.1038/s41598-021-00854-x
_version_ 1784609259821989888
author Jalali-najafabadi, Farideh
Stadler, Michael
Dand, Nick
Jadon, Deepak
Soomro, Mehreen
Ho, Pauline
Marzo-Ortega, Helen
Helliwell, Philip
Korendowych, Eleanor
Simpson, Michael A.
Packham, Jonathan
Smith, Catherine H.
Barker, Jonathan N.
McHugh, Neil
Warren, Richard B.
Barton, Anne
Bowes, John
author_facet Jalali-najafabadi, Farideh
Stadler, Michael
Dand, Nick
Jadon, Deepak
Soomro, Mehreen
Ho, Pauline
Marzo-Ortega, Helen
Helliwell, Philip
Korendowych, Eleanor
Simpson, Michael A.
Packham, Jonathan
Smith, Catherine H.
Barker, Jonathan N.
McHugh, Neil
Warren, Richard B.
Barton, Anne
Bowes, John
author_sort Jalali-najafabadi, Farideh
collection PubMed
description In view of the growth of clinical risk prediction models using genetic data, there is an increasing need for studies that use appropriate methods to select the optimum number of features from a large number of genetic variants with a high degree of redundancy between features due to linkage disequilibrium (LD). Filter feature selection methods based on information theoretic criteria, are well suited to this challenge and will identify a subset of the original variables that should result in more accurate prediction. However, data collected from cohort studies are often high-dimensional genetic data with potential confounders presenting challenges to feature selection and risk prediction machine learning models. Patients with psoriasis are at high risk of developing a chronic arthritis known as psoriatic arthritis (PsA). The prevalence of PsA in this patient group can be up to 30% and the identification of high risk patients represents an important clinical research which would allow early intervention and a reduction of disability. This also provides us with an ideal scenario for the development of clinical risk prediction models and an opportunity to explore the application of information theoretic criteria methods. In this study, we developed the feature selection and psoriatic arthritis (PsA) risk prediction models that were applied to a cross-sectional genetic dataset of 1462 PsA cases and 1132 cutaneous-only psoriasis (PsC) cases using 2-digit HLA alleles imputed using the SNP2HLA algorithm. We also developed stratification method to mitigate the impact of potential confounder features and illustrate that confounding features impact the feature selection. The mitigated dataset was used in training of seven supervised algorithms. 80% of data was randomly used for training of seven supervised machine learning methods using stratified nested cross validation and 20% was selected randomly as a holdout set for internal validation. The risk prediction models were then further validated in UK Biobank dataset containing data on 1187 participants and a set of features overlapping with the training dataset.Performance of these methods has been evaluated using the area under the curve (AUC), accuracy, precision, recall, F1 score and decision curve analysis(net benefit). The best model is selected based on three criteria: the ‘lowest number of feature subset’ with the ‘maximal average AUC over the nested cross validation’ and good generalisability to the UK Biobank dataset. In the original dataset, with over 100 different bootstraps and seven feature selection (FS) methods, HLA_C_*06 was selected as the most informative genetic variant. When the dataset is mitigated the single most important genetic features based on rank was identified as HLA_B_*27 by the seven different feature selection methods, consistent with previous analyses of this data using regression based methods. However, the predictive accuracy of these single features in post mitigation was found to be moderate (AUC= 0.54 (internal cross validation), AUC=0.53 (internal hold out set), AUC=0.55(external data set)). Sequentially adding additional HLA features based on rank improved the performance of the Random Forest classification model where 20 2-digit features selected by Interaction Capping (ICAP) demonstrated (AUC= 0.61 (internal cross validation), AUC=0.57 (internal hold out set), AUC=0.58 (external dataset)). The stratification method for mitigation of confounding features and filter information theoretic feature selection can be applied to a high dimensional dataset with the potential confounders.
format Online
Article
Text
id pubmed-8640070
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Nature Publishing Group UK
record_format MEDLINE/PubMed
spelling pubmed-86400702021-12-06 Application of information theoretic feature selection and machine learning methods for the development of genetic risk prediction models Jalali-najafabadi, Farideh Stadler, Michael Dand, Nick Jadon, Deepak Soomro, Mehreen Ho, Pauline Marzo-Ortega, Helen Helliwell, Philip Korendowych, Eleanor Simpson, Michael A. Packham, Jonathan Smith, Catherine H. Barker, Jonathan N. McHugh, Neil Warren, Richard B. Barton, Anne Bowes, John Sci Rep Article In view of the growth of clinical risk prediction models using genetic data, there is an increasing need for studies that use appropriate methods to select the optimum number of features from a large number of genetic variants with a high degree of redundancy between features due to linkage disequilibrium (LD). Filter feature selection methods based on information theoretic criteria, are well suited to this challenge and will identify a subset of the original variables that should result in more accurate prediction. However, data collected from cohort studies are often high-dimensional genetic data with potential confounders presenting challenges to feature selection and risk prediction machine learning models. Patients with psoriasis are at high risk of developing a chronic arthritis known as psoriatic arthritis (PsA). The prevalence of PsA in this patient group can be up to 30% and the identification of high risk patients represents an important clinical research which would allow early intervention and a reduction of disability. This also provides us with an ideal scenario for the development of clinical risk prediction models and an opportunity to explore the application of information theoretic criteria methods. In this study, we developed the feature selection and psoriatic arthritis (PsA) risk prediction models that were applied to a cross-sectional genetic dataset of 1462 PsA cases and 1132 cutaneous-only psoriasis (PsC) cases using 2-digit HLA alleles imputed using the SNP2HLA algorithm. We also developed stratification method to mitigate the impact of potential confounder features and illustrate that confounding features impact the feature selection. The mitigated dataset was used in training of seven supervised algorithms. 80% of data was randomly used for training of seven supervised machine learning methods using stratified nested cross validation and 20% was selected randomly as a holdout set for internal validation. The risk prediction models were then further validated in UK Biobank dataset containing data on 1187 participants and a set of features overlapping with the training dataset.Performance of these methods has been evaluated using the area under the curve (AUC), accuracy, precision, recall, F1 score and decision curve analysis(net benefit). The best model is selected based on three criteria: the ‘lowest number of feature subset’ with the ‘maximal average AUC over the nested cross validation’ and good generalisability to the UK Biobank dataset. In the original dataset, with over 100 different bootstraps and seven feature selection (FS) methods, HLA_C_*06 was selected as the most informative genetic variant. When the dataset is mitigated the single most important genetic features based on rank was identified as HLA_B_*27 by the seven different feature selection methods, consistent with previous analyses of this data using regression based methods. However, the predictive accuracy of these single features in post mitigation was found to be moderate (AUC= 0.54 (internal cross validation), AUC=0.53 (internal hold out set), AUC=0.55(external data set)). Sequentially adding additional HLA features based on rank improved the performance of the Random Forest classification model where 20 2-digit features selected by Interaction Capping (ICAP) demonstrated (AUC= 0.61 (internal cross validation), AUC=0.57 (internal hold out set), AUC=0.58 (external dataset)). The stratification method for mitigation of confounding features and filter information theoretic feature selection can be applied to a high dimensional dataset with the potential confounders. Nature Publishing Group UK 2021-12-02 /pmc/articles/PMC8640070/ /pubmed/34857774 http://dx.doi.org/10.1038/s41598-021-00854-x Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Article
Jalali-najafabadi, Farideh
Stadler, Michael
Dand, Nick
Jadon, Deepak
Soomro, Mehreen
Ho, Pauline
Marzo-Ortega, Helen
Helliwell, Philip
Korendowych, Eleanor
Simpson, Michael A.
Packham, Jonathan
Smith, Catherine H.
Barker, Jonathan N.
McHugh, Neil
Warren, Richard B.
Barton, Anne
Bowes, John
Application of information theoretic feature selection and machine learning methods for the development of genetic risk prediction models
title Application of information theoretic feature selection and machine learning methods for the development of genetic risk prediction models
title_full Application of information theoretic feature selection and machine learning methods for the development of genetic risk prediction models
title_fullStr Application of information theoretic feature selection and machine learning methods for the development of genetic risk prediction models
title_full_unstemmed Application of information theoretic feature selection and machine learning methods for the development of genetic risk prediction models
title_short Application of information theoretic feature selection and machine learning methods for the development of genetic risk prediction models
title_sort application of information theoretic feature selection and machine learning methods for the development of genetic risk prediction models
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8640070/
https://www.ncbi.nlm.nih.gov/pubmed/34857774
http://dx.doi.org/10.1038/s41598-021-00854-x
work_keys_str_mv AT jalalinajafabadifarideh applicationofinformationtheoreticfeatureselectionandmachinelearningmethodsforthedevelopmentofgeneticriskpredictionmodels
AT stadlermichael applicationofinformationtheoreticfeatureselectionandmachinelearningmethodsforthedevelopmentofgeneticriskpredictionmodels
AT dandnick applicationofinformationtheoreticfeatureselectionandmachinelearningmethodsforthedevelopmentofgeneticriskpredictionmodels
AT jadondeepak applicationofinformationtheoreticfeatureselectionandmachinelearningmethodsforthedevelopmentofgeneticriskpredictionmodels
AT soomromehreen applicationofinformationtheoreticfeatureselectionandmachinelearningmethodsforthedevelopmentofgeneticriskpredictionmodels
AT hopauline applicationofinformationtheoreticfeatureselectionandmachinelearningmethodsforthedevelopmentofgeneticriskpredictionmodels
AT marzoortegahelen applicationofinformationtheoreticfeatureselectionandmachinelearningmethodsforthedevelopmentofgeneticriskpredictionmodels
AT helliwellphilip applicationofinformationtheoreticfeatureselectionandmachinelearningmethodsforthedevelopmentofgeneticriskpredictionmodels
AT korendowycheleanor applicationofinformationtheoreticfeatureselectionandmachinelearningmethodsforthedevelopmentofgeneticriskpredictionmodels
AT simpsonmichaela applicationofinformationtheoreticfeatureselectionandmachinelearningmethodsforthedevelopmentofgeneticriskpredictionmodels
AT packhamjonathan applicationofinformationtheoreticfeatureselectionandmachinelearningmethodsforthedevelopmentofgeneticriskpredictionmodels
AT smithcatherineh applicationofinformationtheoreticfeatureselectionandmachinelearningmethodsforthedevelopmentofgeneticriskpredictionmodels
AT barkerjonathann applicationofinformationtheoreticfeatureselectionandmachinelearningmethodsforthedevelopmentofgeneticriskpredictionmodels
AT mchughneil applicationofinformationtheoreticfeatureselectionandmachinelearningmethodsforthedevelopmentofgeneticriskpredictionmodels
AT warrenrichardb applicationofinformationtheoreticfeatureselectionandmachinelearningmethodsforthedevelopmentofgeneticriskpredictionmodels
AT bartonanne applicationofinformationtheoreticfeatureselectionandmachinelearningmethodsforthedevelopmentofgeneticriskpredictionmodels
AT bowesjohn applicationofinformationtheoreticfeatureselectionandmachinelearningmethodsforthedevelopmentofgeneticriskpredictionmodels
AT applicationofinformationtheoreticfeatureselectionandmachinelearningmethodsforthedevelopmentofgeneticriskpredictionmodels
AT applicationofinformationtheoreticfeatureselectionandmachinelearningmethodsforthedevelopmentofgeneticriskpredictionmodels