论文标题
可恶的模因通过互补的视觉和语言网络检测
Hateful Memes Detection via Complementary Visual and Linguistic Networks
论文作者
论文摘要
可恨的模因在社交媒体中广泛传达,并传达负面信息。仇恨模因发现的主要挑战是表达含义不能被单一模态得到很好的认识。为了进一步整合模态信息,我们研究了2020年《仇恨模因挑战》中基于互补视觉和语言网络的候选解决方案。这样,可以详细探讨多模式的更全面的信息。在视觉和语言嵌入中都考虑了上下文级别和敏感的对象级信息,以制定复杂的多模式场景。具体而言,利用了预训练的分类器和对象检测器从输入中获取上下文特征和利益区域(ROI),然后进行视觉嵌入的位置表示融合。语言嵌入由三个组成部分组成,即嵌入句子单词,位置嵌入和相应的Spacy嵌入(SEMBADDDING),这是由Spacy提取的词汇表示的符号。视觉和语言嵌入都被馈入设计的互补视觉和语言(CVL)网络,以产生可恨模因的预测。仇恨模因挑战数据集的实验结果表明,CVL提供了体面的性能,并在AUROC和准确性的标准上产生78:48%和72:95%。代码可在https://github.com/webyfdt/hateful上找到。
Hateful memes are widespread in social media and convey negative information. The main challenge of hateful memes detection is that the expressive meaning can not be well recognized by a single modality. In order to further integrate modal information, we investigate a candidate solution based on complementary visual and linguistic network in Hateful Memes Challenge 2020. In this way, more comprehensive information of the multi-modality could be explored in detail. Both contextual-level and sensitive object-level information are considered in visual and linguistic embedding to formulate the complex multi-modal scenarios. Specifically, a pre-trained classifier and object detector are utilized to obtain the contextual features and region-of-interests (RoIs) from the input, followed by the position representation fusion for visual embedding. While linguistic embedding is composed of three components, i.e., the sentence words embedding, position embedding and the corresponding Spacy embedding (Sembedding), which is a symbol represented by vocabulary extracted by Spacy. Both visual and linguistic embedding are fed into the designed Complementary Visual and Linguistic (CVL) networks to produce the prediction for hateful memes. Experimental results on Hateful Memes Challenge Dataset demonstrate that CVL provides a decent performance, and produces 78:48% and 72:95% on the criteria of AUROC and Accuracy. Code is available at https://github.com/webYFDT/hateful.