Cargando…

Consequences of ignoring clustering in linear regression

BACKGROUND: Clustering of observations is a common phenomenon in epidemiological and clinical research. Previous studies have highlighted the importance of using multilevel analysis to account for such clustering, but in practice, methods ignoring clustering are often employed. We used simulated dat...

Descripción completa

Detalles Bibliográficos
Autores principales: Ntani, Georgia, Inskip, Hazel, Osmond, Clive, Coggon, David
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8265092/
https://www.ncbi.nlm.nih.gov/pubmed/34233609
http://dx.doi.org/10.1186/s12874-021-01333-7
_version_ 1783719699967639552
author Ntani, Georgia
Inskip, Hazel
Osmond, Clive
Coggon, David
author_facet Ntani, Georgia
Inskip, Hazel
Osmond, Clive
Coggon, David
author_sort Ntani, Georgia
collection PubMed
description BACKGROUND: Clustering of observations is a common phenomenon in epidemiological and clinical research. Previous studies have highlighted the importance of using multilevel analysis to account for such clustering, but in practice, methods ignoring clustering are often employed. We used simulated data to explore the circumstances in which failure to account for clustering in linear regression could lead to importantly erroneous conclusions. METHODS: We simulated data following the random-intercept model specification under different scenarios of clustering of a continuous outcome and a single continuous or binary explanatory variable. We fitted random-intercept (RI) and ordinary least squares (OLS) models and compared effect estimates with the “true” value that had been used in simulation. We also assessed the relative precision of effect estimates, and explored the extent to which coverage by 95% confidence intervals and Type I error rates were appropriate. RESULTS: We found that effect estimates from both types of regression model were on average unbiased. However, deviations from the “true” value were greater when the outcome variable was more clustered. For a continuous explanatory variable, they tended also to be greater for the OLS than the RI model, and when the explanatory variable was less clustered. The precision of effect estimates from the OLS model was overestimated when the explanatory variable varied more between than within clusters, and was somewhat underestimated when the explanatory variable was less clustered. The cluster-unadjusted model gave poor coverage rates by 95% confidence intervals and high Type I error rates when the explanatory variable was continuous. With a binary explanatory variable, coverage rates by 95% confidence intervals and Type I error rates deviated from nominal values when the outcome variable was more clustered, but the direction of the deviation varied according to the overall prevalence of the explanatory variable, and the extent to which it was clustered. CONCLUSIONS: In this study we identified circumstances in which application of an OLS regression model to clustered data is more likely to mislead statistical inference. The potential for error is greatest when the explanatory variable is continuous, and the outcome variable more clustered (intraclass correlation coefficient is ≥ 0.01). SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12874-021-01333-7.
format Online
Article
Text
id pubmed-8265092
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-82650922021-07-08 Consequences of ignoring clustering in linear regression Ntani, Georgia Inskip, Hazel Osmond, Clive Coggon, David BMC Med Res Methodol Research Article BACKGROUND: Clustering of observations is a common phenomenon in epidemiological and clinical research. Previous studies have highlighted the importance of using multilevel analysis to account for such clustering, but in practice, methods ignoring clustering are often employed. We used simulated data to explore the circumstances in which failure to account for clustering in linear regression could lead to importantly erroneous conclusions. METHODS: We simulated data following the random-intercept model specification under different scenarios of clustering of a continuous outcome and a single continuous or binary explanatory variable. We fitted random-intercept (RI) and ordinary least squares (OLS) models and compared effect estimates with the “true” value that had been used in simulation. We also assessed the relative precision of effect estimates, and explored the extent to which coverage by 95% confidence intervals and Type I error rates were appropriate. RESULTS: We found that effect estimates from both types of regression model were on average unbiased. However, deviations from the “true” value were greater when the outcome variable was more clustered. For a continuous explanatory variable, they tended also to be greater for the OLS than the RI model, and when the explanatory variable was less clustered. The precision of effect estimates from the OLS model was overestimated when the explanatory variable varied more between than within clusters, and was somewhat underestimated when the explanatory variable was less clustered. The cluster-unadjusted model gave poor coverage rates by 95% confidence intervals and high Type I error rates when the explanatory variable was continuous. With a binary explanatory variable, coverage rates by 95% confidence intervals and Type I error rates deviated from nominal values when the outcome variable was more clustered, but the direction of the deviation varied according to the overall prevalence of the explanatory variable, and the extent to which it was clustered. CONCLUSIONS: In this study we identified circumstances in which application of an OLS regression model to clustered data is more likely to mislead statistical inference. The potential for error is greatest when the explanatory variable is continuous, and the outcome variable more clustered (intraclass correlation coefficient is ≥ 0.01). SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12874-021-01333-7. BioMed Central 2021-07-07 /pmc/articles/PMC8265092/ /pubmed/34233609 http://dx.doi.org/10.1186/s12874-021-01333-7 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research Article
Ntani, Georgia
Inskip, Hazel
Osmond, Clive
Coggon, David
Consequences of ignoring clustering in linear regression
title Consequences of ignoring clustering in linear regression
title_full Consequences of ignoring clustering in linear regression
title_fullStr Consequences of ignoring clustering in linear regression
title_full_unstemmed Consequences of ignoring clustering in linear regression
title_short Consequences of ignoring clustering in linear regression
title_sort consequences of ignoring clustering in linear regression
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8265092/
https://www.ncbi.nlm.nih.gov/pubmed/34233609
http://dx.doi.org/10.1186/s12874-021-01333-7
work_keys_str_mv AT ntanigeorgia consequencesofignoringclusteringinlinearregression
AT inskiphazel consequencesofignoringclusteringinlinearregression
AT osmondclive consequencesofignoringclusteringinlinearregression
AT coggondavid consequencesofignoringclusteringinlinearregression