Cargando…
Incorporating metadata in HIV transmission network reconstruction: A machine learning feasibility assessment
HIV molecular epidemiology estimates the transmission patterns from clustering genetically similar viruses. The process involves connecting genetically similar genotyped viral sequences in the network implying epidemiological transmissions. This technique relies on genotype data which is collected o...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8457453/ https://www.ncbi.nlm.nih.gov/pubmed/34550966 http://dx.doi.org/10.1371/journal.pcbi.1009336 |
_version_ | 1784571099096285184 |
---|---|
author | Mazrouee, Sepideh Little, Susan J. Wertheim, Joel O. |
author_facet | Mazrouee, Sepideh Little, Susan J. Wertheim, Joel O. |
author_sort | Mazrouee, Sepideh |
collection | PubMed |
description | HIV molecular epidemiology estimates the transmission patterns from clustering genetically similar viruses. The process involves connecting genetically similar genotyped viral sequences in the network implying epidemiological transmissions. This technique relies on genotype data which is collected only from HIV diagnosed and in-care populations and leaves many persons with HIV (PWH) who have no access to consistent care out of the tracking process. We use machine learning algorithms to learn the non-linear correlation patterns between patient metadata and transmissions between HIV-positive cases. This enables us to expand the transmission network reconstruction beyond the molecular network. We employed multiple commonly used supervised classification algorithms to analyze the San Diego Primary Infection Resource Consortium (PIRC) cohort dataset, consisting of genotypes and nearly 80 additional non-genetic features. First, we trained classification models to determine genetically unrelated individuals from related ones. Our results show that random forest and decision tree achieved over 80% in accuracy, precision, recall, and F1-score by only using a subset of meta-features including age, birth sex, sexual orientation, race, transmission category, estimated date of infection, and first viral load date besides genetic data. Additionally, both algorithms achieved approximately 80% sensitivity and specificity. The Area Under Curve (AUC) is reported 97% and 94% for random forest and decision tree classifiers respectively. Next, we extended the models to identify clusters of similar viral sequences. Support vector machine demonstrated one order of magnitude improvement in accuracy of assigning the sequences to the correct cluster compared to dummy uniform random classifier. These results confirm that metadata carries important information about the dynamics of HIV transmission as embedded in transmission clusters. Hence, novel computational approaches are needed to apply the non-trivial knowledge collected from inter-individual genetic information to metadata from PWH in order to expand the estimated transmissions. We note that feature extraction alone will not be effective in identifying patterns of transmission and will result in random clustering of the data, but its utilization in conjunction with genetic data and the right algorithm can contribute to the expansion of the reconstructed network beyond individuals with genetic data. |
format | Online Article Text |
id | pubmed-8457453 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-84574532021-09-23 Incorporating metadata in HIV transmission network reconstruction: A machine learning feasibility assessment Mazrouee, Sepideh Little, Susan J. Wertheim, Joel O. PLoS Comput Biol Research Article HIV molecular epidemiology estimates the transmission patterns from clustering genetically similar viruses. The process involves connecting genetically similar genotyped viral sequences in the network implying epidemiological transmissions. This technique relies on genotype data which is collected only from HIV diagnosed and in-care populations and leaves many persons with HIV (PWH) who have no access to consistent care out of the tracking process. We use machine learning algorithms to learn the non-linear correlation patterns between patient metadata and transmissions between HIV-positive cases. This enables us to expand the transmission network reconstruction beyond the molecular network. We employed multiple commonly used supervised classification algorithms to analyze the San Diego Primary Infection Resource Consortium (PIRC) cohort dataset, consisting of genotypes and nearly 80 additional non-genetic features. First, we trained classification models to determine genetically unrelated individuals from related ones. Our results show that random forest and decision tree achieved over 80% in accuracy, precision, recall, and F1-score by only using a subset of meta-features including age, birth sex, sexual orientation, race, transmission category, estimated date of infection, and first viral load date besides genetic data. Additionally, both algorithms achieved approximately 80% sensitivity and specificity. The Area Under Curve (AUC) is reported 97% and 94% for random forest and decision tree classifiers respectively. Next, we extended the models to identify clusters of similar viral sequences. Support vector machine demonstrated one order of magnitude improvement in accuracy of assigning the sequences to the correct cluster compared to dummy uniform random classifier. These results confirm that metadata carries important information about the dynamics of HIV transmission as embedded in transmission clusters. Hence, novel computational approaches are needed to apply the non-trivial knowledge collected from inter-individual genetic information to metadata from PWH in order to expand the estimated transmissions. We note that feature extraction alone will not be effective in identifying patterns of transmission and will result in random clustering of the data, but its utilization in conjunction with genetic data and the right algorithm can contribute to the expansion of the reconstructed network beyond individuals with genetic data. Public Library of Science 2021-09-22 /pmc/articles/PMC8457453/ /pubmed/34550966 http://dx.doi.org/10.1371/journal.pcbi.1009336 Text en © 2021 Mazrouee et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. |
spellingShingle | Research Article Mazrouee, Sepideh Little, Susan J. Wertheim, Joel O. Incorporating metadata in HIV transmission network reconstruction: A machine learning feasibility assessment |
title | Incorporating metadata in HIV transmission network reconstruction: A machine learning feasibility assessment |
title_full | Incorporating metadata in HIV transmission network reconstruction: A machine learning feasibility assessment |
title_fullStr | Incorporating metadata in HIV transmission network reconstruction: A machine learning feasibility assessment |
title_full_unstemmed | Incorporating metadata in HIV transmission network reconstruction: A machine learning feasibility assessment |
title_short | Incorporating metadata in HIV transmission network reconstruction: A machine learning feasibility assessment |
title_sort | incorporating metadata in hiv transmission network reconstruction: a machine learning feasibility assessment |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8457453/ https://www.ncbi.nlm.nih.gov/pubmed/34550966 http://dx.doi.org/10.1371/journal.pcbi.1009336 |
work_keys_str_mv | AT mazroueesepideh incorporatingmetadatainhivtransmissionnetworkreconstructionamachinelearningfeasibilityassessment AT littlesusanj incorporatingmetadatainhivtransmissionnetworkreconstructionamachinelearningfeasibilityassessment AT wertheimjoelo incorporatingmetadatainhivtransmissionnetworkreconstructionamachinelearningfeasibilityassessment |