Cargando…

Multimodal detection of hateful memes by applying a vision-language pre-training model

Detrimental to individuals and society, online hateful messages have recently become a major social issue. Among them, one new type of hateful message, “hateful meme”, has emerged and brought difficulties in traditional deep learning-based detection. Because hateful memes were formatted with both te...

Descripción completa

Detalles Bibliográficos
Autores principales:	Chen, Yuyang, Pan, Feng
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2022
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9467312/ https://www.ncbi.nlm.nih.gov/pubmed/36095029 http://dx.doi.org/10.1371/journal.pone.0274300

_version_	1784788165086674944
author	Chen, Yuyang Pan, Feng
author_facet	Chen, Yuyang Pan, Feng
author_sort	Chen, Yuyang
collection	PubMed
description	Detrimental to individuals and society, online hateful messages have recently become a major social issue. Among them, one new type of hateful message, “hateful meme”, has emerged and brought difficulties in traditional deep learning-based detection. Because hateful memes were formatted with both text captions and images to express users’ intents, they cannot be accurately identified by singularly analyzing embedded text captions or images. In order to effectively detect a hateful meme, the algorithm must possess strong vision and language fusion capability. In this study, we move closer to this goal by feeding a triplet by stacking the visual features, object tags, and text features of memes generated by the object detection model named Visual features in Vision-Language (VinVl) and the optical character recognition (OCR) technology into a Transformer-based Vision-Language Pre-Training Model (VL-PTM) OSCAR+ to perform the cross-modal learning of memes. After fine-tuning and connecting to a random forest (RF) classifier, our model (OSCAR+RF) achieved an average accuracy and AUROC of 0.684 and 0.768, respectively, on the hateful meme detection task in a public test set, which was higher than the other eleven (11) published baselines. In conclusion, this study has demonstrated that VL-PTMs with the addition of anchor points can improve the performance of deep learning-based detection of hateful memes by involving a more substantial alignment between the text caption and visual information.
format	Online Article Text
id	pubmed-9467312
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-94673122022-09-13 Multimodal detection of hateful memes by applying a vision-language pre-training model Chen, Yuyang Pan, Feng PLoS One Research Article Detrimental to individuals and society, online hateful messages have recently become a major social issue. Among them, one new type of hateful message, “hateful meme”, has emerged and brought difficulties in traditional deep learning-based detection. Because hateful memes were formatted with both text captions and images to express users’ intents, they cannot be accurately identified by singularly analyzing embedded text captions or images. In order to effectively detect a hateful meme, the algorithm must possess strong vision and language fusion capability. In this study, we move closer to this goal by feeding a triplet by stacking the visual features, object tags, and text features of memes generated by the object detection model named Visual features in Vision-Language (VinVl) and the optical character recognition (OCR) technology into a Transformer-based Vision-Language Pre-Training Model (VL-PTM) OSCAR+ to perform the cross-modal learning of memes. After fine-tuning and connecting to a random forest (RF) classifier, our model (OSCAR+RF) achieved an average accuracy and AUROC of 0.684 and 0.768, respectively, on the hateful meme detection task in a public test set, which was higher than the other eleven (11) published baselines. In conclusion, this study has demonstrated that VL-PTMs with the addition of anchor points can improve the performance of deep learning-based detection of hateful memes by involving a more substantial alignment between the text caption and visual information. Public Library of Science 2022-09-12 /pmc/articles/PMC9467312/ /pubmed/36095029 http://dx.doi.org/10.1371/journal.pone.0274300 Text en © 2022 Chen, Pan https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Chen, Yuyang Pan, Feng Multimodal detection of hateful memes by applying a vision-language pre-training model
title	Multimodal detection of hateful memes by applying a vision-language pre-training model
title_full	Multimodal detection of hateful memes by applying a vision-language pre-training model
title_fullStr	Multimodal detection of hateful memes by applying a vision-language pre-training model
title_full_unstemmed	Multimodal detection of hateful memes by applying a vision-language pre-training model
title_short	Multimodal detection of hateful memes by applying a vision-language pre-training model
title_sort	multimodal detection of hateful memes by applying a vision-language pre-training model
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9467312/ https://www.ncbi.nlm.nih.gov/pubmed/36095029 http://dx.doi.org/10.1371/journal.pone.0274300
work_keys_str_mv	AT chenyuyang multimodaldetectionofhatefulmemesbyapplyingavisionlanguagepretrainingmodel AT panfeng multimodaldetectionofhatefulmemesbyapplyingavisionlanguagepretrainingmodel

Multimodal detection of hateful memes by applying a vision-language pre-training model

Ejemplares similares