Cargando…

A comparison of feature selection and classification methods in DNA methylation studies using the Illumina Infinium platform

BACKGROUND: The 27k Illumina Infinium Methylation Beadchip is a popular high-throughput technology that allows the methylation state of over 27,000 CpGs to be assayed. While feature selection and classification methods have been comprehensively explored in the context of gene expression data, relati...

Descripción completa

Detalles Bibliográficos
Autores principales: Zhuang, Joanna, Widschwendter, Martin, Teschendorff, Andrew E
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3364843/
https://www.ncbi.nlm.nih.gov/pubmed/22524302
http://dx.doi.org/10.1186/1471-2105-13-59
_version_ 1782234590315806720
author Zhuang, Joanna
Widschwendter, Martin
Teschendorff, Andrew E
author_facet Zhuang, Joanna
Widschwendter, Martin
Teschendorff, Andrew E
author_sort Zhuang, Joanna
collection PubMed
description BACKGROUND: The 27k Illumina Infinium Methylation Beadchip is a popular high-throughput technology that allows the methylation state of over 27,000 CpGs to be assayed. While feature selection and classification methods have been comprehensively explored in the context of gene expression data, relatively little is known as to how best to perform feature selection or classification in the context of Illumina Infinium methylation data. Given the rising importance of epigenomics in cancer and other complex genetic diseases, and in view of the upcoming epigenome wide association studies, it is critical to identify the statistical methods that offer improved inference in this novel context. RESULTS: Using a total of 7 large Illumina Infinium 27k Methylation data sets, encompassing over 1,000 samples from a wide range of tissues, we here provide an evaluation of popular feature selection, dimensional reduction and classification methods on DNA methylation data. Specifically, we evaluate the effects of variance filtering, supervised principal components (SPCA) and the choice of DNA methylation quantification measure on downstream statistical inference. We show that for relatively large sample sizes feature selection using test statistics is similar for M and β-values, but that in the limit of small sample sizes, M-values allow more reliable identification of true positives. We also show that the effect of variance filtering on feature selection is study-specific and dependent on the phenotype of interest and tissue type profiled. Specifically, we find that variance filtering improves the detection of true positives in studies with large effect sizes, but that it may lead to worse performance in studies with smaller yet significant effect sizes. In contrast, supervised principal components improves the statistical power, especially in studies with small effect sizes. We also demonstrate that classification using the Elastic Net and Support Vector Machine (SVM) clearly outperforms competing methods like LASSO and SPCA. Finally, in unsupervised modelling of cancer diagnosis, we find that non-negative matrix factorisation (NMF) clearly outperforms principal components analysis. CONCLUSIONS: Our results highlight the importance of tailoring the feature selection and classification methodology to the sample size and biological context of the DNA methylation study. The Elastic Net emerges as a powerful classification algorithm for large-scale DNA methylation studies, while NMF does well in the unsupervised context. The insights presented here will be useful to any study embarking on large-scale DNA methylation profiling using Illumina Infinium beadarrays.
format Online
Article
Text
id pubmed-3364843
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-33648432012-06-05 A comparison of feature selection and classification methods in DNA methylation studies using the Illumina Infinium platform Zhuang, Joanna Widschwendter, Martin Teschendorff, Andrew E BMC Bioinformatics Research Article BACKGROUND: The 27k Illumina Infinium Methylation Beadchip is a popular high-throughput technology that allows the methylation state of over 27,000 CpGs to be assayed. While feature selection and classification methods have been comprehensively explored in the context of gene expression data, relatively little is known as to how best to perform feature selection or classification in the context of Illumina Infinium methylation data. Given the rising importance of epigenomics in cancer and other complex genetic diseases, and in view of the upcoming epigenome wide association studies, it is critical to identify the statistical methods that offer improved inference in this novel context. RESULTS: Using a total of 7 large Illumina Infinium 27k Methylation data sets, encompassing over 1,000 samples from a wide range of tissues, we here provide an evaluation of popular feature selection, dimensional reduction and classification methods on DNA methylation data. Specifically, we evaluate the effects of variance filtering, supervised principal components (SPCA) and the choice of DNA methylation quantification measure on downstream statistical inference. We show that for relatively large sample sizes feature selection using test statistics is similar for M and β-values, but that in the limit of small sample sizes, M-values allow more reliable identification of true positives. We also show that the effect of variance filtering on feature selection is study-specific and dependent on the phenotype of interest and tissue type profiled. Specifically, we find that variance filtering improves the detection of true positives in studies with large effect sizes, but that it may lead to worse performance in studies with smaller yet significant effect sizes. In contrast, supervised principal components improves the statistical power, especially in studies with small effect sizes. We also demonstrate that classification using the Elastic Net and Support Vector Machine (SVM) clearly outperforms competing methods like LASSO and SPCA. Finally, in unsupervised modelling of cancer diagnosis, we find that non-negative matrix factorisation (NMF) clearly outperforms principal components analysis. CONCLUSIONS: Our results highlight the importance of tailoring the feature selection and classification methodology to the sample size and biological context of the DNA methylation study. The Elastic Net emerges as a powerful classification algorithm for large-scale DNA methylation studies, while NMF does well in the unsupervised context. The insights presented here will be useful to any study embarking on large-scale DNA methylation profiling using Illumina Infinium beadarrays. BioMed Central 2012-04-24 /pmc/articles/PMC3364843/ /pubmed/22524302 http://dx.doi.org/10.1186/1471-2105-13-59 Text en Copyright ©2012 Zhuang et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Zhuang, Joanna
Widschwendter, Martin
Teschendorff, Andrew E
A comparison of feature selection and classification methods in DNA methylation studies using the Illumina Infinium platform
title A comparison of feature selection and classification methods in DNA methylation studies using the Illumina Infinium platform
title_full A comparison of feature selection and classification methods in DNA methylation studies using the Illumina Infinium platform
title_fullStr A comparison of feature selection and classification methods in DNA methylation studies using the Illumina Infinium platform
title_full_unstemmed A comparison of feature selection and classification methods in DNA methylation studies using the Illumina Infinium platform
title_short A comparison of feature selection and classification methods in DNA methylation studies using the Illumina Infinium platform
title_sort comparison of feature selection and classification methods in dna methylation studies using the illumina infinium platform
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3364843/
https://www.ncbi.nlm.nih.gov/pubmed/22524302
http://dx.doi.org/10.1186/1471-2105-13-59
work_keys_str_mv AT zhuangjoanna acomparisonoffeatureselectionandclassificationmethodsindnamethylationstudiesusingtheilluminainfiniumplatform
AT widschwendtermartin acomparisonoffeatureselectionandclassificationmethodsindnamethylationstudiesusingtheilluminainfiniumplatform
AT teschendorffandrewe acomparisonoffeatureselectionandclassificationmethodsindnamethylationstudiesusingtheilluminainfiniumplatform
AT zhuangjoanna comparisonoffeatureselectionandclassificationmethodsindnamethylationstudiesusingtheilluminainfiniumplatform
AT widschwendtermartin comparisonoffeatureselectionandclassificationmethodsindnamethylationstudiesusingtheilluminainfiniumplatform
AT teschendorffandrewe comparisonoffeatureselectionandclassificationmethodsindnamethylationstudiesusingtheilluminainfiniumplatform