Cargando…

2897. Development of a Machine Learning Modelling Tool for Predicting Incident HIV Using Public Health Data from a County in the Southern United States

BACKGROUND: Machine Learning (ML) algorithms have predicted incident HIV using electronic medical record (EMR) data. We developed an ML model using de-identified public health data from a high-incidence area to predict incident HIV which could inform public health interventions such as HIV testing,...

Descripción completa

Detalles Bibliográficos
Autores principales: Saldana, Carlos s, Burkhardt, Elizabeth, Pennisi, Alfred, Oliver, Kirsten, Olmstead, John, Holland, David P, Gettings, Jenna, Wortley, Pascale, Saldana Ochoa, Karla V
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10678962/
http://dx.doi.org/10.1093/ofid/ofad500.168
Descripción
Sumario:BACKGROUND: Machine Learning (ML) algorithms have predicted incident HIV using electronic medical record (EMR) data. We developed an ML model using de-identified public health data from a high-incidence area to predict incident HIV which could inform public health interventions such as HIV testing, education, and scale-up prevention strategies. METHODS: We used de-identified public health data from Georgia’s State Electronic Notifiable Disease Surveillance System (SendSS) and Enhanced HIV/AIDS Reporting System (eHARs) from 01/2010 to 12/2021 in Fulton County - GA. Included variables are displayed in Table 1. We included males, 13 years of age and older. Patient's HIV status and HIV incidence during the study period were confirmed by matching individuals between the datasets. We excluded individuals diagnosed with HIV before 2010, those with an HIV diagnosis as their first sexually transmitted infection (STI) dataset entry, and those individuals with more than 10% of variables missing. We matched a social vulnerability index (SVI) to an individual census tract. We trained various ML classification models with an equal number of HIV-positive and randomly selected HIV-negative observations to balance both the training (85%) and test sets (15%) to predict incident HIV. [Figure: see text] RESULTS: Of 85,224 individuals, a total of 1,698 male individuals (2%) were confirmed positive for HIV during the study period and met our inclusion criteria. The training set included 2,896 observations (1,448 HIV+ and 1,448 HIV-) and the test set included 500 observations (250 HIV+ and 250 HIV-). Among the ML models used, Gradient Boosted Trees and Random Forest achieved an accuracy as high as 80% for correctly predicting incident HIV in the test set. The most predictive features were mean age at STI diagnosis, STI diagnosing provider type, STI diagnosis interval, SVI Theme 1, and STI diagnosis. Model performance and evaluation are presented in Figure 1. [Figure: see text] Random Forest (RF) and Gradient Boosted Trees (GBT) confusion matrix on a test set of 500 observations. Overall both RF and GBT models achieved an overall high accuracy in correctly predicting incident HIV in 202/250 (79%) and 204/250 (80%) individuals respectively. Precision= number of true positives divided by the total number of positive predictions; Recall= Percentage of observation the model correctly identifies as belonging to their class; F-1 Score= Combined score of precision and recall. CONCLUSION: Our ML models can accurately predict incident HIV and can be used to customize outreach activities. The approach used is unique in that it strictly relies on de-identified STI reporting public health data, which makes it suitable for a broader population than EMR data. However, more research is needed to implement and evaluate these models in actual public health interventions. DISCLOSURES: All Authors: No reported disclosures