Cargando…
Search Term Identification Methods for Computational Health Communication: Word Embedding and Network Approach for Health Content on YouTube
BACKGROUND: Common methods for extracting content in health communication research typically involve using a set of well-established queries, often names of medical procedures or diseases, that are often technical or rarely used in the public discussion of health topics. Although these methods produ...
Autores principales: | , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
JMIR Publications
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9472050/ https://www.ncbi.nlm.nih.gov/pubmed/36040760 http://dx.doi.org/10.2196/37862 |
_version_ | 1784789222042894336 |
---|---|
author | Tong, Chau Margolin, Drew Chunara, Rumi Niederdeppe, Jeff Taylor, Teairah Dunbar, Natalie King, Andy J |
author_facet | Tong, Chau Margolin, Drew Chunara, Rumi Niederdeppe, Jeff Taylor, Teairah Dunbar, Natalie King, Andy J |
author_sort | Tong, Chau |
collection | PubMed |
description | BACKGROUND: Common methods for extracting content in health communication research typically involve using a set of well-established queries, often names of medical procedures or diseases, that are often technical or rarely used in the public discussion of health topics. Although these methods produce high recall (ie, retrieve highly relevant content), they tend to overlook health messages that feature colloquial language and layperson vocabularies on social media. Given how such messages could contain misinformation or obscure content that circumvents official medical concepts, correctly identifying (and analyzing) them is crucial to the study of user-generated health content on social media platforms. OBJECTIVE: Health communication scholars would benefit from a retrieval process that goes beyond the use of standard terminologies as search queries. Motivated by this, this study aims to put forward a search term identification method to improve the retrieval of user-generated health content on social media. We focused on cancer screening tests as a subject and YouTube as a platform case study. METHODS: We retrieved YouTube videos using cancer screening procedures (colonoscopy, fecal occult blood test, mammogram, and pap test) as seed queries. We then trained word embedding models using text features from these videos to identify the nearest neighbor terms that are semantically similar to cancer screening tests in colloquial language. Retrieving more YouTube videos from the top neighbor terms, we coded a sample of 150 random videos from each term for relevance. We then used text mining to examine the new content retrieved from these videos and network analysis to inspect the relations between the newly retrieved videos and videos from the seed queries. RESULTS: The top terms with semantic similarities to cancer screening tests were identified via word embedding models. Text mining analysis showed that the 5 nearest neighbor terms retrieved content that was novel and contextually diverse, beyond the content retrieved from cancer screening concepts alone. Results from network analysis showed that the newly retrieved videos had at least one total degree of connection (sum of indegree and outdegree) with seed videos according to YouTube relatedness measures. CONCLUSIONS: We demonstrated a retrieval technique to improve recall and minimize precision loss, which can be extended to various health topics on YouTube, a popular video-sharing social media platform. We discussed how health communication scholars can apply the technique to inspect the performance of the retrieval strategy before investing human coding resources and outlined suggestions on how such a technique can be extended to other health contexts. |
format | Online Article Text |
id | pubmed-9472050 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | JMIR Publications |
record_format | MEDLINE/PubMed |
spelling | pubmed-94720502022-09-15 Search Term Identification Methods for Computational Health Communication: Word Embedding and Network Approach for Health Content on YouTube Tong, Chau Margolin, Drew Chunara, Rumi Niederdeppe, Jeff Taylor, Teairah Dunbar, Natalie King, Andy J JMIR Med Inform Original Paper BACKGROUND: Common methods for extracting content in health communication research typically involve using a set of well-established queries, often names of medical procedures or diseases, that are often technical or rarely used in the public discussion of health topics. Although these methods produce high recall (ie, retrieve highly relevant content), they tend to overlook health messages that feature colloquial language and layperson vocabularies on social media. Given how such messages could contain misinformation or obscure content that circumvents official medical concepts, correctly identifying (and analyzing) them is crucial to the study of user-generated health content on social media platforms. OBJECTIVE: Health communication scholars would benefit from a retrieval process that goes beyond the use of standard terminologies as search queries. Motivated by this, this study aims to put forward a search term identification method to improve the retrieval of user-generated health content on social media. We focused on cancer screening tests as a subject and YouTube as a platform case study. METHODS: We retrieved YouTube videos using cancer screening procedures (colonoscopy, fecal occult blood test, mammogram, and pap test) as seed queries. We then trained word embedding models using text features from these videos to identify the nearest neighbor terms that are semantically similar to cancer screening tests in colloquial language. Retrieving more YouTube videos from the top neighbor terms, we coded a sample of 150 random videos from each term for relevance. We then used text mining to examine the new content retrieved from these videos and network analysis to inspect the relations between the newly retrieved videos and videos from the seed queries. RESULTS: The top terms with semantic similarities to cancer screening tests were identified via word embedding models. Text mining analysis showed that the 5 nearest neighbor terms retrieved content that was novel and contextually diverse, beyond the content retrieved from cancer screening concepts alone. Results from network analysis showed that the newly retrieved videos had at least one total degree of connection (sum of indegree and outdegree) with seed videos according to YouTube relatedness measures. CONCLUSIONS: We demonstrated a retrieval technique to improve recall and minimize precision loss, which can be extended to various health topics on YouTube, a popular video-sharing social media platform. We discussed how health communication scholars can apply the technique to inspect the performance of the retrieval strategy before investing human coding resources and outlined suggestions on how such a technique can be extended to other health contexts. JMIR Publications 2022-08-30 /pmc/articles/PMC9472050/ /pubmed/36040760 http://dx.doi.org/10.2196/37862 Text en ©Chau Tong, Drew Margolin, Rumi Chunara, Jeff Niederdeppe, Teairah Taylor, Natalie Dunbar, Andy J King. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 30.08.2022. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included. |
spellingShingle | Original Paper Tong, Chau Margolin, Drew Chunara, Rumi Niederdeppe, Jeff Taylor, Teairah Dunbar, Natalie King, Andy J Search Term Identification Methods for Computational Health Communication: Word Embedding and Network Approach for Health Content on YouTube |
title | Search Term Identification Methods for Computational Health Communication: Word Embedding and Network Approach for Health Content on YouTube |
title_full | Search Term Identification Methods for Computational Health Communication: Word Embedding and Network Approach for Health Content on YouTube |
title_fullStr | Search Term Identification Methods for Computational Health Communication: Word Embedding and Network Approach for Health Content on YouTube |
title_full_unstemmed | Search Term Identification Methods for Computational Health Communication: Word Embedding and Network Approach for Health Content on YouTube |
title_short | Search Term Identification Methods for Computational Health Communication: Word Embedding and Network Approach for Health Content on YouTube |
title_sort | search term identification methods for computational health communication: word embedding and network approach for health content on youtube |
topic | Original Paper |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9472050/ https://www.ncbi.nlm.nih.gov/pubmed/36040760 http://dx.doi.org/10.2196/37862 |
work_keys_str_mv | AT tongchau searchtermidentificationmethodsforcomputationalhealthcommunicationwordembeddingandnetworkapproachforhealthcontentonyoutube AT margolindrew searchtermidentificationmethodsforcomputationalhealthcommunicationwordembeddingandnetworkapproachforhealthcontentonyoutube AT chunararumi searchtermidentificationmethodsforcomputationalhealthcommunicationwordembeddingandnetworkapproachforhealthcontentonyoutube AT niederdeppejeff searchtermidentificationmethodsforcomputationalhealthcommunicationwordembeddingandnetworkapproachforhealthcontentonyoutube AT taylorteairah searchtermidentificationmethodsforcomputationalhealthcommunicationwordembeddingandnetworkapproachforhealthcontentonyoutube AT dunbarnatalie searchtermidentificationmethodsforcomputationalhealthcommunicationwordembeddingandnetworkapproachforhealthcontentonyoutube AT kingandyj searchtermidentificationmethodsforcomputationalhealthcommunicationwordembeddingandnetworkapproachforhealthcontentonyoutube |