Cargando…

Influence of Feature Encoding and Choice of Classifier on Disease Risk Prediction in Genome-Wide Association Studies

Various attempts have been made to predict the individual disease risk based on genotype data from genome-wide association studies (GWAS). However, most studies only investigated one or two classification algorithms and feature encoding schemes. In this study, we applied seven different classificati...

Descripción completa

Detalles Bibliográficos
Autores principales: Mittag, Florian, Römer, Michael, Zell, Andreas
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4540285/
https://www.ncbi.nlm.nih.gov/pubmed/26285210
http://dx.doi.org/10.1371/journal.pone.0135832
_version_ 1782386225776164864
author Mittag, Florian
Römer, Michael
Zell, Andreas
author_facet Mittag, Florian
Römer, Michael
Zell, Andreas
author_sort Mittag, Florian
collection PubMed
description Various attempts have been made to predict the individual disease risk based on genotype data from genome-wide association studies (GWAS). However, most studies only investigated one or two classification algorithms and feature encoding schemes. In this study, we applied seven different classification algorithms on GWAS case-control data sets for seven different diseases to create models for disease risk prediction. Further, we used three different encoding schemes for the genotypes of single nucleotide polymorphisms (SNPs) and investigated their influence on the predictive performance of these models. Our study suggests that an additive encoding of the SNP data should be the preferred encoding scheme, as it proved to yield the best predictive performances for all algorithms and data sets. Furthermore, our results showed that the differences between most state-of-the-art classification algorithms are not statistically significant. Consequently, we recommend to prefer algorithms with simple models like the linear support vector machine (SVM) as they allow for better subsequent interpretation without significant loss of accuracy.
format Online
Article
Text
id pubmed-4540285
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-45402852015-08-24 Influence of Feature Encoding and Choice of Classifier on Disease Risk Prediction in Genome-Wide Association Studies Mittag, Florian Römer, Michael Zell, Andreas PLoS One Research Article Various attempts have been made to predict the individual disease risk based on genotype data from genome-wide association studies (GWAS). However, most studies only investigated one or two classification algorithms and feature encoding schemes. In this study, we applied seven different classification algorithms on GWAS case-control data sets for seven different diseases to create models for disease risk prediction. Further, we used three different encoding schemes for the genotypes of single nucleotide polymorphisms (SNPs) and investigated their influence on the predictive performance of these models. Our study suggests that an additive encoding of the SNP data should be the preferred encoding scheme, as it proved to yield the best predictive performances for all algorithms and data sets. Furthermore, our results showed that the differences between most state-of-the-art classification algorithms are not statistically significant. Consequently, we recommend to prefer algorithms with simple models like the linear support vector machine (SVM) as they allow for better subsequent interpretation without significant loss of accuracy. Public Library of Science 2015-08-18 /pmc/articles/PMC4540285/ /pubmed/26285210 http://dx.doi.org/10.1371/journal.pone.0135832 Text en © 2015 Mittag et al http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Mittag, Florian
Römer, Michael
Zell, Andreas
Influence of Feature Encoding and Choice of Classifier on Disease Risk Prediction in Genome-Wide Association Studies
title Influence of Feature Encoding and Choice of Classifier on Disease Risk Prediction in Genome-Wide Association Studies
title_full Influence of Feature Encoding and Choice of Classifier on Disease Risk Prediction in Genome-Wide Association Studies
title_fullStr Influence of Feature Encoding and Choice of Classifier on Disease Risk Prediction in Genome-Wide Association Studies
title_full_unstemmed Influence of Feature Encoding and Choice of Classifier on Disease Risk Prediction in Genome-Wide Association Studies
title_short Influence of Feature Encoding and Choice of Classifier on Disease Risk Prediction in Genome-Wide Association Studies
title_sort influence of feature encoding and choice of classifier on disease risk prediction in genome-wide association studies
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4540285/
https://www.ncbi.nlm.nih.gov/pubmed/26285210
http://dx.doi.org/10.1371/journal.pone.0135832
work_keys_str_mv AT mittagflorian influenceoffeatureencodingandchoiceofclassifierondiseaseriskpredictioningenomewideassociationstudies
AT romermichael influenceoffeatureencodingandchoiceofclassifierondiseaseriskpredictioningenomewideassociationstudies
AT zellandreas influenceoffeatureencodingandchoiceofclassifierondiseaseriskpredictioningenomewideassociationstudies