Cargando…

Epidemiological cluster identification using multiple data sources: an approach using logistic regression

In the management of infectious disease outbreaks, grouping cases into clusters and understanding their underlying epidemiology are fundamental tasks. In genomic epidemiology, clusters are typically identified either using pathogen sequences alone or with sequences in combination with epidemiologica...

Descripción completa

Detalles Bibliográficos
Autores principales:	Susvitasari, Kurnia, Tupper, Paul F., Cancino-Muños, Irving, Lòpez, Mariana G., Comas, Iñaki, Colijn, Caroline
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Microbiology Society 2023
Materias:	Research Articles
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10132077/ https://www.ncbi.nlm.nih.gov/pubmed/36867086 http://dx.doi.org/10.1099/mgen.0.000929

_version_	1785031322356416512
author	Susvitasari, Kurnia Tupper, Paul F. Cancino-Muños, Irving Lòpez, Mariana G. Comas, Iñaki Colijn, Caroline
author_facet	Susvitasari, Kurnia Tupper, Paul F. Cancino-Muños, Irving Lòpez, Mariana G. Comas, Iñaki Colijn, Caroline
author_sort	Susvitasari, Kurnia
collection	PubMed
description	In the management of infectious disease outbreaks, grouping cases into clusters and understanding their underlying epidemiology are fundamental tasks. In genomic epidemiology, clusters are typically identified either using pathogen sequences alone or with sequences in combination with epidemiological data such as location and time of collection. However, it may not be feasible to culture and sequence all pathogen isolates, so sequence data may not be available for all cases. This presents challenges for identifying clusters and understanding epidemiology, because these cases may be important for transmission. Demographic, clinical and location data are likely to be available for unsequenced cases, and comprise partial information about their clustering. Here, we use statistical modelling to assign unsequenced cases to clusters already identified by genomic methods, assuming that a more direct method of linking individuals, such as contact tracing, is not available. We build our model on pairwise similarity between cases to predict whether cases cluster together, in contrast to using individual case data to predict the cases’ clusters. We then develop methods that allow us to determine whether a pair of unsequenced cases are likely to cluster together, to group them into their most probable clusters, to identify which are most likely to be members of a specific (known) cluster, and to estimate the true size of a known cluster given a set of unsequenced cases. We apply our method to tuberculosis data from Valencia, Spain. Among other applications, we find that clustering can be predicted successfully using spatial distance between cases and whether nationality is the same. We can identify the correct cluster for an unsequenced case, among 38 possible clusters, with an accuracy of approximately 35 %, higher than both direct multinomial regression (17 %) and random selection (< 5 %).
format	Online Article Text
id	pubmed-10132077
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Microbiology Society
record_format	MEDLINE/PubMed
spelling	pubmed-101320772023-04-27 Epidemiological cluster identification using multiple data sources: an approach using logistic regression Susvitasari, Kurnia Tupper, Paul F. Cancino-Muños, Irving Lòpez, Mariana G. Comas, Iñaki Colijn, Caroline Microb Genom Research Articles In the management of infectious disease outbreaks, grouping cases into clusters and understanding their underlying epidemiology are fundamental tasks. In genomic epidemiology, clusters are typically identified either using pathogen sequences alone or with sequences in combination with epidemiological data such as location and time of collection. However, it may not be feasible to culture and sequence all pathogen isolates, so sequence data may not be available for all cases. This presents challenges for identifying clusters and understanding epidemiology, because these cases may be important for transmission. Demographic, clinical and location data are likely to be available for unsequenced cases, and comprise partial information about their clustering. Here, we use statistical modelling to assign unsequenced cases to clusters already identified by genomic methods, assuming that a more direct method of linking individuals, such as contact tracing, is not available. We build our model on pairwise similarity between cases to predict whether cases cluster together, in contrast to using individual case data to predict the cases’ clusters. We then develop methods that allow us to determine whether a pair of unsequenced cases are likely to cluster together, to group them into their most probable clusters, to identify which are most likely to be members of a specific (known) cluster, and to estimate the true size of a known cluster given a set of unsequenced cases. We apply our method to tuberculosis data from Valencia, Spain. Among other applications, we find that clustering can be predicted successfully using spatial distance between cases and whether nationality is the same. We can identify the correct cluster for an unsequenced case, among 38 possible clusters, with an accuracy of approximately 35 %, higher than both direct multinomial regression (17 %) and random selection (< 5 %). Microbiology Society 2023-03-03 /pmc/articles/PMC10132077/ /pubmed/36867086 http://dx.doi.org/10.1099/mgen.0.000929 Text en © 2023 The Authors https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License. This article was made open access via a Publish and Read agreement between the Microbiology Society and the corresponding author’s institution.
spellingShingle	Research Articles Susvitasari, Kurnia Tupper, Paul F. Cancino-Muños, Irving Lòpez, Mariana G. Comas, Iñaki Colijn, Caroline Epidemiological cluster identification using multiple data sources: an approach using logistic regression
title	Epidemiological cluster identification using multiple data sources: an approach using logistic regression
title_full	Epidemiological cluster identification using multiple data sources: an approach using logistic regression
title_fullStr	Epidemiological cluster identification using multiple data sources: an approach using logistic regression
title_full_unstemmed	Epidemiological cluster identification using multiple data sources: an approach using logistic regression
title_short	Epidemiological cluster identification using multiple data sources: an approach using logistic regression
title_sort	epidemiological cluster identification using multiple data sources: an approach using logistic regression
topic	Research Articles
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10132077/ https://www.ncbi.nlm.nih.gov/pubmed/36867086 http://dx.doi.org/10.1099/mgen.0.000929
work_keys_str_mv	AT susvitasarikurnia epidemiologicalclusteridentificationusingmultipledatasourcesanapproachusinglogisticregression AT tupperpaulf epidemiologicalclusteridentificationusingmultipledatasourcesanapproachusinglogisticregression AT cancinomunosirving epidemiologicalclusteridentificationusingmultipledatasourcesanapproachusinglogisticregression AT lopezmarianag epidemiologicalclusteridentificationusingmultipledatasourcesanapproachusinglogisticregression AT comasinaki epidemiologicalclusteridentificationusingmultipledatasourcesanapproachusinglogisticregression AT colijncaroline epidemiologicalclusteridentificationusingmultipledatasourcesanapproachusinglogisticregression

Epidemiological cluster identification using multiple data sources: an approach using logistic regression

Ejemplares similares