Cargando…

Careful feature selection is key in classification of Alzheimer’s disease patients based on whole-genome sequencing data

Despite great increase of the amount of data from genome-wide association studies (GWAS) and whole-genome sequencing (WGS), the genetic background of a partially heritable Alzheimer’s disease (AD) is not fully understood yet. Machine learning methods are expected to help researchers in the analysis...

Descripción completa

Detalles Bibliográficos
Autores principales: Osipowicz, Marlena, Wilczynski, Bartek, Machnicka, Magdalena A
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8315124/
https://www.ncbi.nlm.nih.gov/pubmed/34327330
http://dx.doi.org/10.1093/nargab/lqab069
_version_ 1783729672201175040
author Osipowicz, Marlena
Wilczynski, Bartek
Machnicka, Magdalena A
author_facet Osipowicz, Marlena
Wilczynski, Bartek
Machnicka, Magdalena A
author_sort Osipowicz, Marlena
collection PubMed
description Despite great increase of the amount of data from genome-wide association studies (GWAS) and whole-genome sequencing (WGS), the genetic background of a partially heritable Alzheimer’s disease (AD) is not fully understood yet. Machine learning methods are expected to help researchers in the analysis of the large number of SNPs possibly associated with the disease onset. To date, a number of such approaches were applied to genotype-based classification of AD patients and healthy controls using GWAS data and reported accuracy of 0.65–0.975. However, since the estimated influence of genotype on sporadic AD occurrence is lower than that, these very high classification accuracies may potentially be a result of overfitting. We have explored the possibilities of applying feature selection and classification using random forests to WGS and GWAS data from two datasets. Our results suggest that this approach is prone to overfitting if feature selection is performed before division of data into the training and testing set. Therefore, we recommend avoiding selection of features used to build the model based on data included in the testing set. We suggest that for currently available dataset sizes the expected classifier performance is between 0.55 and 0.7 (AUC) and higher accuracies reported in literature are likely a result of overfitting.
format Online
Article
Text
id pubmed-8315124
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-83151242021-07-28 Careful feature selection is key in classification of Alzheimer’s disease patients based on whole-genome sequencing data Osipowicz, Marlena Wilczynski, Bartek Machnicka, Magdalena A NAR Genom Bioinform Standard Article Despite great increase of the amount of data from genome-wide association studies (GWAS) and whole-genome sequencing (WGS), the genetic background of a partially heritable Alzheimer’s disease (AD) is not fully understood yet. Machine learning methods are expected to help researchers in the analysis of the large number of SNPs possibly associated with the disease onset. To date, a number of such approaches were applied to genotype-based classification of AD patients and healthy controls using GWAS data and reported accuracy of 0.65–0.975. However, since the estimated influence of genotype on sporadic AD occurrence is lower than that, these very high classification accuracies may potentially be a result of overfitting. We have explored the possibilities of applying feature selection and classification using random forests to WGS and GWAS data from two datasets. Our results suggest that this approach is prone to overfitting if feature selection is performed before division of data into the training and testing set. Therefore, we recommend avoiding selection of features used to build the model based on data included in the testing set. We suggest that for currently available dataset sizes the expected classifier performance is between 0.55 and 0.7 (AUC) and higher accuracies reported in literature are likely a result of overfitting. Oxford University Press 2021-07-27 /pmc/articles/PMC8315124/ /pubmed/34327330 http://dx.doi.org/10.1093/nargab/lqab069 Text en © The Author(s) 2021. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics. https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (http://creativecommons.org/licenses/by-nc/4.0/ (https://creativecommons.org/licenses/by-nc/4.0/) ), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Standard Article
Osipowicz, Marlena
Wilczynski, Bartek
Machnicka, Magdalena A
Careful feature selection is key in classification of Alzheimer’s disease patients based on whole-genome sequencing data
title Careful feature selection is key in classification of Alzheimer’s disease patients based on whole-genome sequencing data
title_full Careful feature selection is key in classification of Alzheimer’s disease patients based on whole-genome sequencing data
title_fullStr Careful feature selection is key in classification of Alzheimer’s disease patients based on whole-genome sequencing data
title_full_unstemmed Careful feature selection is key in classification of Alzheimer’s disease patients based on whole-genome sequencing data
title_short Careful feature selection is key in classification of Alzheimer’s disease patients based on whole-genome sequencing data
title_sort careful feature selection is key in classification of alzheimer’s disease patients based on whole-genome sequencing data
topic Standard Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8315124/
https://www.ncbi.nlm.nih.gov/pubmed/34327330
http://dx.doi.org/10.1093/nargab/lqab069
work_keys_str_mv AT osipowiczmarlena carefulfeatureselectioniskeyinclassificationofalzheimersdiseasepatientsbasedonwholegenomesequencingdata
AT wilczynskibartek carefulfeatureselectioniskeyinclassificationofalzheimersdiseasepatientsbasedonwholegenomesequencingdata
AT machnickamagdalenaa carefulfeatureselectioniskeyinclassificationofalzheimersdiseasepatientsbasedonwholegenomesequencingdata
AT carefulfeatureselectioniskeyinclassificationofalzheimersdiseasepatientsbasedonwholegenomesequencingdata