Cargando…

Detection of suspicious interactions of spiking covariates in methylation data

BACKGROUND: In methylation analyses like epigenome-wide association studies, a high amount of biomarkers is tested for an association between the measured continuous outcome and different covariates. In the case of a continuous covariate like smoking pack years (SPY), a measure of lifetime exposure...

Descripción completa

Detalles Bibliográficos
Autores principales: Sieg, Miriam, Richter, Gesa, Schaefer, Arne S., Kruppa, Jochen
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6993406/
https://www.ncbi.nlm.nih.gov/pubmed/32000657
http://dx.doi.org/10.1186/s12859-020-3364-6
_version_ 1783493026545401856
author Sieg, Miriam
Richter, Gesa
Schaefer, Arne S.
Kruppa, Jochen
author_facet Sieg, Miriam
Richter, Gesa
Schaefer, Arne S.
Kruppa, Jochen
author_sort Sieg, Miriam
collection PubMed
description BACKGROUND: In methylation analyses like epigenome-wide association studies, a high amount of biomarkers is tested for an association between the measured continuous outcome and different covariates. In the case of a continuous covariate like smoking pack years (SPY), a measure of lifetime exposure to tobacco toxins, a spike at zero can occur. Hence, all non-smokers are generating a peak at zero, while the smoking patients are distributed over the other SPY values. Additionally, the spike might also occur on the right side of the covariate distribution, if a category “heavy smoker” is designed. Here, we will focus on methylation data with a spike at the left or the right of the distribution of a continuous covariate. After the methylation data is generated, analysis is usually performed by preprocessing, quality control, and determination of differentially methylated sites, often performed in pipeline fashion. Hence, the data is processed in a string of methods, which are available in one software package. The pipelines can distinguish between categorical covariates, i.e. for group comparisons or continuous covariates, i.e. for linear regression. The differential methylation analysis is often done internally by a linear regression without checking its inherent assumptions. A spike in the continuous covariate is ignored and can cause biased results. RESULTS: We have reanalysed five data sets, four freely available from ArrayExpress, including methylation data and smoking habits reported by smoking pack years. Therefore, we generated an algorithm to check for the occurrences of suspicious interactions between the values associated with the spike position and the non-spike positions of the covariate. Our algorithm helps to decide if a suspicious interaction can be found and further investigations should be carried out. This is mostly important, because the information on the differentially methylated sites will be used for post-hoc analyses like pathway analyses. CONCLUSIONS: We help to check for the validation of the linear regression assumptions in a methylation analysis pipeline. These assumptions should also be considered for machine learning approaches. In addition, we are able to detect outliers in the continuous covariate. Therefore, more statistical robust results should be produced in methylation analysis using our algorithm as a preprocessing step.
format Online
Article
Text
id pubmed-6993406
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-69934062020-02-04 Detection of suspicious interactions of spiking covariates in methylation data Sieg, Miriam Richter, Gesa Schaefer, Arne S. Kruppa, Jochen BMC Bioinformatics Methodology Article BACKGROUND: In methylation analyses like epigenome-wide association studies, a high amount of biomarkers is tested for an association between the measured continuous outcome and different covariates. In the case of a continuous covariate like smoking pack years (SPY), a measure of lifetime exposure to tobacco toxins, a spike at zero can occur. Hence, all non-smokers are generating a peak at zero, while the smoking patients are distributed over the other SPY values. Additionally, the spike might also occur on the right side of the covariate distribution, if a category “heavy smoker” is designed. Here, we will focus on methylation data with a spike at the left or the right of the distribution of a continuous covariate. After the methylation data is generated, analysis is usually performed by preprocessing, quality control, and determination of differentially methylated sites, often performed in pipeline fashion. Hence, the data is processed in a string of methods, which are available in one software package. The pipelines can distinguish between categorical covariates, i.e. for group comparisons or continuous covariates, i.e. for linear regression. The differential methylation analysis is often done internally by a linear regression without checking its inherent assumptions. A spike in the continuous covariate is ignored and can cause biased results. RESULTS: We have reanalysed five data sets, four freely available from ArrayExpress, including methylation data and smoking habits reported by smoking pack years. Therefore, we generated an algorithm to check for the occurrences of suspicious interactions between the values associated with the spike position and the non-spike positions of the covariate. Our algorithm helps to decide if a suspicious interaction can be found and further investigations should be carried out. This is mostly important, because the information on the differentially methylated sites will be used for post-hoc analyses like pathway analyses. CONCLUSIONS: We help to check for the validation of the linear regression assumptions in a methylation analysis pipeline. These assumptions should also be considered for machine learning approaches. In addition, we are able to detect outliers in the continuous covariate. Therefore, more statistical robust results should be produced in methylation analysis using our algorithm as a preprocessing step. BioMed Central 2020-01-30 /pmc/articles/PMC6993406/ /pubmed/32000657 http://dx.doi.org/10.1186/s12859-020-3364-6 Text en © The Author(s) 2020 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Methodology Article
Sieg, Miriam
Richter, Gesa
Schaefer, Arne S.
Kruppa, Jochen
Detection of suspicious interactions of spiking covariates in methylation data
title Detection of suspicious interactions of spiking covariates in methylation data
title_full Detection of suspicious interactions of spiking covariates in methylation data
title_fullStr Detection of suspicious interactions of spiking covariates in methylation data
title_full_unstemmed Detection of suspicious interactions of spiking covariates in methylation data
title_short Detection of suspicious interactions of spiking covariates in methylation data
title_sort detection of suspicious interactions of spiking covariates in methylation data
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6993406/
https://www.ncbi.nlm.nih.gov/pubmed/32000657
http://dx.doi.org/10.1186/s12859-020-3364-6
work_keys_str_mv AT siegmiriam detectionofsuspiciousinteractionsofspikingcovariatesinmethylationdata
AT richtergesa detectionofsuspiciousinteractionsofspikingcovariatesinmethylationdata
AT schaeferarnes detectionofsuspiciousinteractionsofspikingcovariatesinmethylationdata
AT kruppajochen detectionofsuspiciousinteractionsofspikingcovariatesinmethylationdata