Cargando…

Data mining and machine learning approaches for the integration of genome-wide association and methylation data: methodology and main conclusions from GAW20

BACKGROUND: Multiple layers of genetic and epigenetic variability are being simultaneously explored in an increasing number of health studies. We summarize here different approaches applied in the Data Mining and Machine Learning group at the GAW20 to integrate genome-wide genotype and methylation a...

Descripción completa

Detalles Bibliográficos
Autores principales:	Darst, Burcu, Engelman, Corinne D., Tian, Ye, Lorenzo Bermejo, Justo
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2018
Materias:	Methodology
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6157271/ https://www.ncbi.nlm.nih.gov/pubmed/30255774 http://dx.doi.org/10.1186/s12863-018-0646-3

_version_	1783358247314391040
author	Darst, Burcu Engelman, Corinne D. Tian, Ye Lorenzo Bermejo, Justo
author_facet	Darst, Burcu Engelman, Corinne D. Tian, Ye Lorenzo Bermejo, Justo
author_sort	Darst, Burcu
collection	PubMed
description	BACKGROUND: Multiple layers of genetic and epigenetic variability are being simultaneously explored in an increasing number of health studies. We summarize here different approaches applied in the Data Mining and Machine Learning group at the GAW20 to integrate genome-wide genotype and methylation array data. RESULTS: We provide a non-intimidating introduction to some frequently used methods to investigate high-dimensional molecular data and compare the different approaches tried by group members: random forest, deep learning, cluster analysis, mixed models, and gene-set enrichment analysis. Group contributions were quite heterogeneous regarding investigated data sets (real vs simulated), conducted data quality control and assessed phenotypes (eg, metabolic syndrome vs relative differences of log-transformed triglyceride concentrations before and after fenofibrate treatment). However, some common technical issues were detected, leading to practical recommendations. CONCLUSIONS: Different sources of correlation were identified by group members, including population stratification, family structure, batch effects, linkage disequilibrium and correlation of methylation values at neighboring cytosine-phosphate-guanine (CpG) sites, and the majority of applied approaches were able to take into account identified correlation structures. The ability to efficiently deal with high-dimensional omics data, and the model free nature of the approaches that did not require detailed model specifications were clearly recognized as the main strengths of applied methods. A limitation of random forest is its sensitivity to highly correlated variables. The parameter setup and the interpretation of results from deep learning methods, in particular deep neural networks, can be extremely challenging. Cluster analysis and mixed models may need some predimension reduction based on existing literature, data filtering, and supplementary statistical methods, and gene-set enrichment analysis requires biological insight.
format	Online Article Text
id	pubmed-6157271
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-61572712018-10-01 Data mining and machine learning approaches for the integration of genome-wide association and methylation data: methodology and main conclusions from GAW20 Darst, Burcu Engelman, Corinne D. Tian, Ye Lorenzo Bermejo, Justo BMC Genet Methodology BACKGROUND: Multiple layers of genetic and epigenetic variability are being simultaneously explored in an increasing number of health studies. We summarize here different approaches applied in the Data Mining and Machine Learning group at the GAW20 to integrate genome-wide genotype and methylation array data. RESULTS: We provide a non-intimidating introduction to some frequently used methods to investigate high-dimensional molecular data and compare the different approaches tried by group members: random forest, deep learning, cluster analysis, mixed models, and gene-set enrichment analysis. Group contributions were quite heterogeneous regarding investigated data sets (real vs simulated), conducted data quality control and assessed phenotypes (eg, metabolic syndrome vs relative differences of log-transformed triglyceride concentrations before and after fenofibrate treatment). However, some common technical issues were detected, leading to practical recommendations. CONCLUSIONS: Different sources of correlation were identified by group members, including population stratification, family structure, batch effects, linkage disequilibrium and correlation of methylation values at neighboring cytosine-phosphate-guanine (CpG) sites, and the majority of applied approaches were able to take into account identified correlation structures. The ability to efficiently deal with high-dimensional omics data, and the model free nature of the approaches that did not require detailed model specifications were clearly recognized as the main strengths of applied methods. A limitation of random forest is its sensitivity to highly correlated variables. The parameter setup and the interpretation of results from deep learning methods, in particular deep neural networks, can be extremely challenging. Cluster analysis and mixed models may need some predimension reduction based on existing literature, data filtering, and supplementary statistical methods, and gene-set enrichment analysis requires biological insight. BioMed Central 2018-09-17 /pmc/articles/PMC6157271/ /pubmed/30255774 http://dx.doi.org/10.1186/s12863-018-0646-3 Text en © The Author(s). 2018 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Methodology Darst, Burcu Engelman, Corinne D. Tian, Ye Lorenzo Bermejo, Justo Data mining and machine learning approaches for the integration of genome-wide association and methylation data: methodology and main conclusions from GAW20
title	Data mining and machine learning approaches for the integration of genome-wide association and methylation data: methodology and main conclusions from GAW20
title_full	Data mining and machine learning approaches for the integration of genome-wide association and methylation data: methodology and main conclusions from GAW20
title_fullStr	Data mining and machine learning approaches for the integration of genome-wide association and methylation data: methodology and main conclusions from GAW20
title_full_unstemmed	Data mining and machine learning approaches for the integration of genome-wide association and methylation data: methodology and main conclusions from GAW20
title_short	Data mining and machine learning approaches for the integration of genome-wide association and methylation data: methodology and main conclusions from GAW20
title_sort	data mining and machine learning approaches for the integration of genome-wide association and methylation data: methodology and main conclusions from gaw20
topic	Methodology
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6157271/ https://www.ncbi.nlm.nih.gov/pubmed/30255774 http://dx.doi.org/10.1186/s12863-018-0646-3
work_keys_str_mv	AT darstburcu dataminingandmachinelearningapproachesfortheintegrationofgenomewideassociationandmethylationdatamethodologyandmainconclusionsfromgaw20 AT engelmancorinned dataminingandmachinelearningapproachesfortheintegrationofgenomewideassociationandmethylationdatamethodologyandmainconclusionsfromgaw20 AT tianye dataminingandmachinelearningapproachesfortheintegrationofgenomewideassociationandmethylationdatamethodologyandmainconclusionsfromgaw20 AT lorenzobermejojusto dataminingandmachinelearningapproachesfortheintegrationofgenomewideassociationandmethylationdatamethodologyandmainconclusionsfromgaw20

Data mining and machine learning approaches for the integration of genome-wide association and methylation data: methodology and main conclusions from GAW20

Ejemplares similares