Cargando…
Influence of Feature Encoding and Choice of Classifier on Disease Risk Prediction in Genome-Wide Association Studies
Various attempts have been made to predict the individual disease risk based on genotype data from genome-wide association studies (GWAS). However, most studies only investigated one or two classification algorithms and feature encoding schemes. In this study, we applied seven different classificati...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2015
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4540285/ https://www.ncbi.nlm.nih.gov/pubmed/26285210 http://dx.doi.org/10.1371/journal.pone.0135832 |
_version_ | 1782386225776164864 |
---|---|
author | Mittag, Florian Römer, Michael Zell, Andreas |
author_facet | Mittag, Florian Römer, Michael Zell, Andreas |
author_sort | Mittag, Florian |
collection | PubMed |
description | Various attempts have been made to predict the individual disease risk based on genotype data from genome-wide association studies (GWAS). However, most studies only investigated one or two classification algorithms and feature encoding schemes. In this study, we applied seven different classification algorithms on GWAS case-control data sets for seven different diseases to create models for disease risk prediction. Further, we used three different encoding schemes for the genotypes of single nucleotide polymorphisms (SNPs) and investigated their influence on the predictive performance of these models. Our study suggests that an additive encoding of the SNP data should be the preferred encoding scheme, as it proved to yield the best predictive performances for all algorithms and data sets. Furthermore, our results showed that the differences between most state-of-the-art classification algorithms are not statistically significant. Consequently, we recommend to prefer algorithms with simple models like the linear support vector machine (SVM) as they allow for better subsequent interpretation without significant loss of accuracy. |
format | Online Article Text |
id | pubmed-4540285 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2015 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-45402852015-08-24 Influence of Feature Encoding and Choice of Classifier on Disease Risk Prediction in Genome-Wide Association Studies Mittag, Florian Römer, Michael Zell, Andreas PLoS One Research Article Various attempts have been made to predict the individual disease risk based on genotype data from genome-wide association studies (GWAS). However, most studies only investigated one or two classification algorithms and feature encoding schemes. In this study, we applied seven different classification algorithms on GWAS case-control data sets for seven different diseases to create models for disease risk prediction. Further, we used three different encoding schemes for the genotypes of single nucleotide polymorphisms (SNPs) and investigated their influence on the predictive performance of these models. Our study suggests that an additive encoding of the SNP data should be the preferred encoding scheme, as it proved to yield the best predictive performances for all algorithms and data sets. Furthermore, our results showed that the differences between most state-of-the-art classification algorithms are not statistically significant. Consequently, we recommend to prefer algorithms with simple models like the linear support vector machine (SVM) as they allow for better subsequent interpretation without significant loss of accuracy. Public Library of Science 2015-08-18 /pmc/articles/PMC4540285/ /pubmed/26285210 http://dx.doi.org/10.1371/journal.pone.0135832 Text en © 2015 Mittag et al http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited. |
spellingShingle | Research Article Mittag, Florian Römer, Michael Zell, Andreas Influence of Feature Encoding and Choice of Classifier on Disease Risk Prediction in Genome-Wide Association Studies |
title | Influence of Feature Encoding and Choice of Classifier on Disease Risk Prediction in Genome-Wide Association Studies |
title_full | Influence of Feature Encoding and Choice of Classifier on Disease Risk Prediction in Genome-Wide Association Studies |
title_fullStr | Influence of Feature Encoding and Choice of Classifier on Disease Risk Prediction in Genome-Wide Association Studies |
title_full_unstemmed | Influence of Feature Encoding and Choice of Classifier on Disease Risk Prediction in Genome-Wide Association Studies |
title_short | Influence of Feature Encoding and Choice of Classifier on Disease Risk Prediction in Genome-Wide Association Studies |
title_sort | influence of feature encoding and choice of classifier on disease risk prediction in genome-wide association studies |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4540285/ https://www.ncbi.nlm.nih.gov/pubmed/26285210 http://dx.doi.org/10.1371/journal.pone.0135832 |
work_keys_str_mv | AT mittagflorian influenceoffeatureencodingandchoiceofclassifierondiseaseriskpredictioningenomewideassociationstudies AT romermichael influenceoffeatureencodingandchoiceofclassifierondiseaseriskpredictioningenomewideassociationstudies AT zellandreas influenceoffeatureencodingandchoiceofclassifierondiseaseriskpredictioningenomewideassociationstudies |