Cargando…
2897. Development of a Machine Learning Modelling Tool for Predicting Incident HIV Using Public Health Data from a County in the Southern United States
BACKGROUND: Machine Learning (ML) algorithms have predicted incident HIV using electronic medical record (EMR) data. We developed an ML model using de-identified public health data from a high-incidence area to predict incident HIV which could inform public health interventions such as HIV testing,...
Autores principales: | , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10678962/ http://dx.doi.org/10.1093/ofid/ofad500.168 |
_version_ | 1785150481175150592 |
---|---|
author | Saldana, Carlos s Burkhardt, Elizabeth Pennisi, Alfred Oliver, Kirsten Olmstead, John Holland, David P Gettings, Jenna Wortley, Pascale Saldana Ochoa, Karla V |
author_facet | Saldana, Carlos s Burkhardt, Elizabeth Pennisi, Alfred Oliver, Kirsten Olmstead, John Holland, David P Gettings, Jenna Wortley, Pascale Saldana Ochoa, Karla V |
author_sort | Saldana, Carlos s |
collection | PubMed |
description | BACKGROUND: Machine Learning (ML) algorithms have predicted incident HIV using electronic medical record (EMR) data. We developed an ML model using de-identified public health data from a high-incidence area to predict incident HIV which could inform public health interventions such as HIV testing, education, and scale-up prevention strategies. METHODS: We used de-identified public health data from Georgia’s State Electronic Notifiable Disease Surveillance System (SendSS) and Enhanced HIV/AIDS Reporting System (eHARs) from 01/2010 to 12/2021 in Fulton County - GA. Included variables are displayed in Table 1. We included males, 13 years of age and older. Patient's HIV status and HIV incidence during the study period were confirmed by matching individuals between the datasets. We excluded individuals diagnosed with HIV before 2010, those with an HIV diagnosis as their first sexually transmitted infection (STI) dataset entry, and those individuals with more than 10% of variables missing. We matched a social vulnerability index (SVI) to an individual census tract. We trained various ML classification models with an equal number of HIV-positive and randomly selected HIV-negative observations to balance both the training (85%) and test sets (15%) to predict incident HIV. [Figure: see text] RESULTS: Of 85,224 individuals, a total of 1,698 male individuals (2%) were confirmed positive for HIV during the study period and met our inclusion criteria. The training set included 2,896 observations (1,448 HIV+ and 1,448 HIV-) and the test set included 500 observations (250 HIV+ and 250 HIV-). Among the ML models used, Gradient Boosted Trees and Random Forest achieved an accuracy as high as 80% for correctly predicting incident HIV in the test set. The most predictive features were mean age at STI diagnosis, STI diagnosing provider type, STI diagnosis interval, SVI Theme 1, and STI diagnosis. Model performance and evaluation are presented in Figure 1. [Figure: see text] Random Forest (RF) and Gradient Boosted Trees (GBT) confusion matrix on a test set of 500 observations. Overall both RF and GBT models achieved an overall high accuracy in correctly predicting incident HIV in 202/250 (79%) and 204/250 (80%) individuals respectively. Precision= number of true positives divided by the total number of positive predictions; Recall= Percentage of observation the model correctly identifies as belonging to their class; F-1 Score= Combined score of precision and recall. CONCLUSION: Our ML models can accurately predict incident HIV and can be used to customize outreach activities. The approach used is unique in that it strictly relies on de-identified STI reporting public health data, which makes it suitable for a broader population than EMR data. However, more research is needed to implement and evaluate these models in actual public health interventions. DISCLOSURES: All Authors: No reported disclosures |
format | Online Article Text |
id | pubmed-10678962 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-106789622023-11-27 2897. Development of a Machine Learning Modelling Tool for Predicting Incident HIV Using Public Health Data from a County in the Southern United States Saldana, Carlos s Burkhardt, Elizabeth Pennisi, Alfred Oliver, Kirsten Olmstead, John Holland, David P Gettings, Jenna Wortley, Pascale Saldana Ochoa, Karla V Open Forum Infect Dis Abstract BACKGROUND: Machine Learning (ML) algorithms have predicted incident HIV using electronic medical record (EMR) data. We developed an ML model using de-identified public health data from a high-incidence area to predict incident HIV which could inform public health interventions such as HIV testing, education, and scale-up prevention strategies. METHODS: We used de-identified public health data from Georgia’s State Electronic Notifiable Disease Surveillance System (SendSS) and Enhanced HIV/AIDS Reporting System (eHARs) from 01/2010 to 12/2021 in Fulton County - GA. Included variables are displayed in Table 1. We included males, 13 years of age and older. Patient's HIV status and HIV incidence during the study period were confirmed by matching individuals between the datasets. We excluded individuals diagnosed with HIV before 2010, those with an HIV diagnosis as their first sexually transmitted infection (STI) dataset entry, and those individuals with more than 10% of variables missing. We matched a social vulnerability index (SVI) to an individual census tract. We trained various ML classification models with an equal number of HIV-positive and randomly selected HIV-negative observations to balance both the training (85%) and test sets (15%) to predict incident HIV. [Figure: see text] RESULTS: Of 85,224 individuals, a total of 1,698 male individuals (2%) were confirmed positive for HIV during the study period and met our inclusion criteria. The training set included 2,896 observations (1,448 HIV+ and 1,448 HIV-) and the test set included 500 observations (250 HIV+ and 250 HIV-). Among the ML models used, Gradient Boosted Trees and Random Forest achieved an accuracy as high as 80% for correctly predicting incident HIV in the test set. The most predictive features were mean age at STI diagnosis, STI diagnosing provider type, STI diagnosis interval, SVI Theme 1, and STI diagnosis. Model performance and evaluation are presented in Figure 1. [Figure: see text] Random Forest (RF) and Gradient Boosted Trees (GBT) confusion matrix on a test set of 500 observations. Overall both RF and GBT models achieved an overall high accuracy in correctly predicting incident HIV in 202/250 (79%) and 204/250 (80%) individuals respectively. Precision= number of true positives divided by the total number of positive predictions; Recall= Percentage of observation the model correctly identifies as belonging to their class; F-1 Score= Combined score of precision and recall. CONCLUSION: Our ML models can accurately predict incident HIV and can be used to customize outreach activities. The approach used is unique in that it strictly relies on de-identified STI reporting public health data, which makes it suitable for a broader population than EMR data. However, more research is needed to implement and evaluate these models in actual public health interventions. DISCLOSURES: All Authors: No reported disclosures Oxford University Press 2023-11-27 /pmc/articles/PMC10678962/ http://dx.doi.org/10.1093/ofid/ofad500.168 Text en © The Author(s) 2023. Published by Oxford University Press on behalf of Infectious Diseases Society of America. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Abstract Saldana, Carlos s Burkhardt, Elizabeth Pennisi, Alfred Oliver, Kirsten Olmstead, John Holland, David P Gettings, Jenna Wortley, Pascale Saldana Ochoa, Karla V 2897. Development of a Machine Learning Modelling Tool for Predicting Incident HIV Using Public Health Data from a County in the Southern United States |
title | 2897. Development of a Machine Learning Modelling Tool for Predicting Incident HIV Using Public Health Data from a County in the Southern United States |
title_full | 2897. Development of a Machine Learning Modelling Tool for Predicting Incident HIV Using Public Health Data from a County in the Southern United States |
title_fullStr | 2897. Development of a Machine Learning Modelling Tool for Predicting Incident HIV Using Public Health Data from a County in the Southern United States |
title_full_unstemmed | 2897. Development of a Machine Learning Modelling Tool for Predicting Incident HIV Using Public Health Data from a County in the Southern United States |
title_short | 2897. Development of a Machine Learning Modelling Tool for Predicting Incident HIV Using Public Health Data from a County in the Southern United States |
title_sort | 2897. development of a machine learning modelling tool for predicting incident hiv using public health data from a county in the southern united states |
topic | Abstract |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10678962/ http://dx.doi.org/10.1093/ofid/ofad500.168 |
work_keys_str_mv | AT saldanacarloss 2897developmentofamachinelearningmodellingtoolforpredictingincidenthivusingpublichealthdatafromacountyinthesouthernunitedstates AT burkhardtelizabeth 2897developmentofamachinelearningmodellingtoolforpredictingincidenthivusingpublichealthdatafromacountyinthesouthernunitedstates AT pennisialfred 2897developmentofamachinelearningmodellingtoolforpredictingincidenthivusingpublichealthdatafromacountyinthesouthernunitedstates AT oliverkirsten 2897developmentofamachinelearningmodellingtoolforpredictingincidenthivusingpublichealthdatafromacountyinthesouthernunitedstates AT olmsteadjohn 2897developmentofamachinelearningmodellingtoolforpredictingincidenthivusingpublichealthdatafromacountyinthesouthernunitedstates AT hollanddavidp 2897developmentofamachinelearningmodellingtoolforpredictingincidenthivusingpublichealthdatafromacountyinthesouthernunitedstates AT gettingsjenna 2897developmentofamachinelearningmodellingtoolforpredictingincidenthivusingpublichealthdatafromacountyinthesouthernunitedstates AT wortleypascale 2897developmentofamachinelearningmodellingtoolforpredictingincidenthivusingpublichealthdatafromacountyinthesouthernunitedstates AT saldanaochoakarlav 2897developmentofamachinelearningmodellingtoolforpredictingincidenthivusingpublichealthdatafromacountyinthesouthernunitedstates |