Cargando…

Search Term Identification Methods for Computational Health Communication: Word Embedding and Network Approach for Health Content on YouTube

BACKGROUND: Common methods for extracting content in health communication research typically involve using a set of well-established queries, often names of medical procedures or diseases, that are often technical or rarely used in the public discussion of health topics. Although these methods produ...

Descripción completa

Detalles Bibliográficos
Autores principales: Tong, Chau, Margolin, Drew, Chunara, Rumi, Niederdeppe, Jeff, Taylor, Teairah, Dunbar, Natalie, King, Andy J
Formato: Online Artículo Texto
Lenguaje:English
Publicado: JMIR Publications 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9472050/
https://www.ncbi.nlm.nih.gov/pubmed/36040760
http://dx.doi.org/10.2196/37862
_version_ 1784789222042894336
author Tong, Chau
Margolin, Drew
Chunara, Rumi
Niederdeppe, Jeff
Taylor, Teairah
Dunbar, Natalie
King, Andy J
author_facet Tong, Chau
Margolin, Drew
Chunara, Rumi
Niederdeppe, Jeff
Taylor, Teairah
Dunbar, Natalie
King, Andy J
author_sort Tong, Chau
collection PubMed
description BACKGROUND: Common methods for extracting content in health communication research typically involve using a set of well-established queries, often names of medical procedures or diseases, that are often technical or rarely used in the public discussion of health topics. Although these methods produce high recall (ie, retrieve highly relevant content), they tend to overlook health messages that feature colloquial language and layperson vocabularies on social media. Given how such messages could contain misinformation or obscure content that circumvents official medical concepts, correctly identifying (and analyzing) them is crucial to the study of user-generated health content on social media platforms. OBJECTIVE: Health communication scholars would benefit from a retrieval process that goes beyond the use of standard terminologies as search queries. Motivated by this, this study aims to put forward a search term identification method to improve the retrieval of user-generated health content on social media. We focused on cancer screening tests as a subject and YouTube as a platform case study. METHODS: We retrieved YouTube videos using cancer screening procedures (colonoscopy, fecal occult blood test, mammogram, and pap test) as seed queries. We then trained word embedding models using text features from these videos to identify the nearest neighbor terms that are semantically similar to cancer screening tests in colloquial language. Retrieving more YouTube videos from the top neighbor terms, we coded a sample of 150 random videos from each term for relevance. We then used text mining to examine the new content retrieved from these videos and network analysis to inspect the relations between the newly retrieved videos and videos from the seed queries. RESULTS: The top terms with semantic similarities to cancer screening tests were identified via word embedding models. Text mining analysis showed that the 5 nearest neighbor terms retrieved content that was novel and contextually diverse, beyond the content retrieved from cancer screening concepts alone. Results from network analysis showed that the newly retrieved videos had at least one total degree of connection (sum of indegree and outdegree) with seed videos according to YouTube relatedness measures. CONCLUSIONS: We demonstrated a retrieval technique to improve recall and minimize precision loss, which can be extended to various health topics on YouTube, a popular video-sharing social media platform. We discussed how health communication scholars can apply the technique to inspect the performance of the retrieval strategy before investing human coding resources and outlined suggestions on how such a technique can be extended to other health contexts.
format Online
Article
Text
id pubmed-9472050
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher JMIR Publications
record_format MEDLINE/PubMed
spelling pubmed-94720502022-09-15 Search Term Identification Methods for Computational Health Communication: Word Embedding and Network Approach for Health Content on YouTube Tong, Chau Margolin, Drew Chunara, Rumi Niederdeppe, Jeff Taylor, Teairah Dunbar, Natalie King, Andy J JMIR Med Inform Original Paper BACKGROUND: Common methods for extracting content in health communication research typically involve using a set of well-established queries, often names of medical procedures or diseases, that are often technical or rarely used in the public discussion of health topics. Although these methods produce high recall (ie, retrieve highly relevant content), they tend to overlook health messages that feature colloquial language and layperson vocabularies on social media. Given how such messages could contain misinformation or obscure content that circumvents official medical concepts, correctly identifying (and analyzing) them is crucial to the study of user-generated health content on social media platforms. OBJECTIVE: Health communication scholars would benefit from a retrieval process that goes beyond the use of standard terminologies as search queries. Motivated by this, this study aims to put forward a search term identification method to improve the retrieval of user-generated health content on social media. We focused on cancer screening tests as a subject and YouTube as a platform case study. METHODS: We retrieved YouTube videos using cancer screening procedures (colonoscopy, fecal occult blood test, mammogram, and pap test) as seed queries. We then trained word embedding models using text features from these videos to identify the nearest neighbor terms that are semantically similar to cancer screening tests in colloquial language. Retrieving more YouTube videos from the top neighbor terms, we coded a sample of 150 random videos from each term for relevance. We then used text mining to examine the new content retrieved from these videos and network analysis to inspect the relations between the newly retrieved videos and videos from the seed queries. RESULTS: The top terms with semantic similarities to cancer screening tests were identified via word embedding models. Text mining analysis showed that the 5 nearest neighbor terms retrieved content that was novel and contextually diverse, beyond the content retrieved from cancer screening concepts alone. Results from network analysis showed that the newly retrieved videos had at least one total degree of connection (sum of indegree and outdegree) with seed videos according to YouTube relatedness measures. CONCLUSIONS: We demonstrated a retrieval technique to improve recall and minimize precision loss, which can be extended to various health topics on YouTube, a popular video-sharing social media platform. We discussed how health communication scholars can apply the technique to inspect the performance of the retrieval strategy before investing human coding resources and outlined suggestions on how such a technique can be extended to other health contexts. JMIR Publications 2022-08-30 /pmc/articles/PMC9472050/ /pubmed/36040760 http://dx.doi.org/10.2196/37862 Text en ©Chau Tong, Drew Margolin, Rumi Chunara, Jeff Niederdeppe, Teairah Taylor, Natalie Dunbar, Andy J King. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 30.08.2022. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.
spellingShingle Original Paper
Tong, Chau
Margolin, Drew
Chunara, Rumi
Niederdeppe, Jeff
Taylor, Teairah
Dunbar, Natalie
King, Andy J
Search Term Identification Methods for Computational Health Communication: Word Embedding and Network Approach for Health Content on YouTube
title Search Term Identification Methods for Computational Health Communication: Word Embedding and Network Approach for Health Content on YouTube
title_full Search Term Identification Methods for Computational Health Communication: Word Embedding and Network Approach for Health Content on YouTube
title_fullStr Search Term Identification Methods for Computational Health Communication: Word Embedding and Network Approach for Health Content on YouTube
title_full_unstemmed Search Term Identification Methods for Computational Health Communication: Word Embedding and Network Approach for Health Content on YouTube
title_short Search Term Identification Methods for Computational Health Communication: Word Embedding and Network Approach for Health Content on YouTube
title_sort search term identification methods for computational health communication: word embedding and network approach for health content on youtube
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9472050/
https://www.ncbi.nlm.nih.gov/pubmed/36040760
http://dx.doi.org/10.2196/37862
work_keys_str_mv AT tongchau searchtermidentificationmethodsforcomputationalhealthcommunicationwordembeddingandnetworkapproachforhealthcontentonyoutube
AT margolindrew searchtermidentificationmethodsforcomputationalhealthcommunicationwordembeddingandnetworkapproachforhealthcontentonyoutube
AT chunararumi searchtermidentificationmethodsforcomputationalhealthcommunicationwordembeddingandnetworkapproachforhealthcontentonyoutube
AT niederdeppejeff searchtermidentificationmethodsforcomputationalhealthcommunicationwordembeddingandnetworkapproachforhealthcontentonyoutube
AT taylorteairah searchtermidentificationmethodsforcomputationalhealthcommunicationwordembeddingandnetworkapproachforhealthcontentonyoutube
AT dunbarnatalie searchtermidentificationmethodsforcomputationalhealthcommunicationwordembeddingandnetworkapproachforhealthcontentonyoutube
AT kingandyj searchtermidentificationmethodsforcomputationalhealthcommunicationwordembeddingandnetworkapproachforhealthcontentonyoutube