Cargando…
Careful feature selection is key in classification of Alzheimer’s disease patients based on whole-genome sequencing data
Despite great increase of the amount of data from genome-wide association studies (GWAS) and whole-genome sequencing (WGS), the genetic background of a partially heritable Alzheimer’s disease (AD) is not fully understood yet. Machine learning methods are expected to help researchers in the analysis...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8315124/ https://www.ncbi.nlm.nih.gov/pubmed/34327330 http://dx.doi.org/10.1093/nargab/lqab069 |
_version_ | 1783729672201175040 |
---|---|
author | Osipowicz, Marlena Wilczynski, Bartek Machnicka, Magdalena A |
author_facet | Osipowicz, Marlena Wilczynski, Bartek Machnicka, Magdalena A |
author_sort | Osipowicz, Marlena |
collection | PubMed |
description | Despite great increase of the amount of data from genome-wide association studies (GWAS) and whole-genome sequencing (WGS), the genetic background of a partially heritable Alzheimer’s disease (AD) is not fully understood yet. Machine learning methods are expected to help researchers in the analysis of the large number of SNPs possibly associated with the disease onset. To date, a number of such approaches were applied to genotype-based classification of AD patients and healthy controls using GWAS data and reported accuracy of 0.65–0.975. However, since the estimated influence of genotype on sporadic AD occurrence is lower than that, these very high classification accuracies may potentially be a result of overfitting. We have explored the possibilities of applying feature selection and classification using random forests to WGS and GWAS data from two datasets. Our results suggest that this approach is prone to overfitting if feature selection is performed before division of data into the training and testing set. Therefore, we recommend avoiding selection of features used to build the model based on data included in the testing set. We suggest that for currently available dataset sizes the expected classifier performance is between 0.55 and 0.7 (AUC) and higher accuracies reported in literature are likely a result of overfitting. |
format | Online Article Text |
id | pubmed-8315124 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-83151242021-07-28 Careful feature selection is key in classification of Alzheimer’s disease patients based on whole-genome sequencing data Osipowicz, Marlena Wilczynski, Bartek Machnicka, Magdalena A NAR Genom Bioinform Standard Article Despite great increase of the amount of data from genome-wide association studies (GWAS) and whole-genome sequencing (WGS), the genetic background of a partially heritable Alzheimer’s disease (AD) is not fully understood yet. Machine learning methods are expected to help researchers in the analysis of the large number of SNPs possibly associated with the disease onset. To date, a number of such approaches were applied to genotype-based classification of AD patients and healthy controls using GWAS data and reported accuracy of 0.65–0.975. However, since the estimated influence of genotype on sporadic AD occurrence is lower than that, these very high classification accuracies may potentially be a result of overfitting. We have explored the possibilities of applying feature selection and classification using random forests to WGS and GWAS data from two datasets. Our results suggest that this approach is prone to overfitting if feature selection is performed before division of data into the training and testing set. Therefore, we recommend avoiding selection of features used to build the model based on data included in the testing set. We suggest that for currently available dataset sizes the expected classifier performance is between 0.55 and 0.7 (AUC) and higher accuracies reported in literature are likely a result of overfitting. Oxford University Press 2021-07-27 /pmc/articles/PMC8315124/ /pubmed/34327330 http://dx.doi.org/10.1093/nargab/lqab069 Text en © The Author(s) 2021. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics. https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (http://creativecommons.org/licenses/by-nc/4.0/ (https://creativecommons.org/licenses/by-nc/4.0/) ), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com |
spellingShingle | Standard Article Osipowicz, Marlena Wilczynski, Bartek Machnicka, Magdalena A Careful feature selection is key in classification of Alzheimer’s disease patients based on whole-genome sequencing data |
title | Careful feature selection is key in classification of Alzheimer’s disease patients based on whole-genome sequencing data |
title_full | Careful feature selection is key in classification of Alzheimer’s disease patients based on whole-genome sequencing data |
title_fullStr | Careful feature selection is key in classification of Alzheimer’s disease patients based on whole-genome sequencing data |
title_full_unstemmed | Careful feature selection is key in classification of Alzheimer’s disease patients based on whole-genome sequencing data |
title_short | Careful feature selection is key in classification of Alzheimer’s disease patients based on whole-genome sequencing data |
title_sort | careful feature selection is key in classification of alzheimer’s disease patients based on whole-genome sequencing data |
topic | Standard Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8315124/ https://www.ncbi.nlm.nih.gov/pubmed/34327330 http://dx.doi.org/10.1093/nargab/lqab069 |
work_keys_str_mv | AT osipowiczmarlena carefulfeatureselectioniskeyinclassificationofalzheimersdiseasepatientsbasedonwholegenomesequencingdata AT wilczynskibartek carefulfeatureselectioniskeyinclassificationofalzheimersdiseasepatientsbasedonwholegenomesequencingdata AT machnickamagdalenaa carefulfeatureselectioniskeyinclassificationofalzheimersdiseasepatientsbasedonwholegenomesequencingdata AT carefulfeatureselectioniskeyinclassificationofalzheimersdiseasepatientsbasedonwholegenomesequencingdata |