用概念激活向量提高隐式滥用语言检测的普遍性

论文标题

用概念激活向量提高隐式滥用语言检测的普遍性

Improving Generalizability in Implicitly Abusive Language Detection with Concept Activation Vectors

论文作者

Nejadgholi, Isar, Fraser, Kathleen C., Kiritchenko, Svetlana

论文摘要

机器学习模型在不断变化的现实世界中的鲁棒性至关重要，尤其是对于影响人类福祉（例如内容节制）的应用程序。在响应时事（例如，Covid-19）和部署的滥用检测系统时，应定期更新新型的虐待语言在线讨论中不断出现，以保持准确。在本文中，我们表明，一般的虐待语言分类器在检测出明显的滥用话语时往往相当可靠，但无法检测到新的细微，更隐含的滥用的新型类型。接下来，我们根据计算机视觉的测试概念激活矢量（TCAV）方法提出了一种可解释性技术，以量化训练有素的模型对人为定义的明确和隐式虐待语言的概念的敏感性，并使用它来解释该模型对新数据的推广性，在这种情况下，在这种情况下，在这种情况下，在这种情况下，该模型。扩展了这项技术，我们引入了一个新颖的指标，明确的程度，并表明新指标有益于建议未域外未标记的例子，以用信息，隐含的滥用文本有效地丰富培训数据。

Robustness of machine learning models on ever-changing real-world data is critical, especially for applications affecting human well-being such as content moderation. New kinds of abusive language continually emerge in online discussions in response to current events (e.g., COVID-19), and the deployed abuse detection systems should be updated regularly to remain accurate. In this paper, we show that general abusive language classifiers tend to be fairly reliable in detecting out-of-domain explicitly abusive utterances but fail to detect new types of more subtle, implicit abuse. Next, we propose an interpretability technique, based on the Testing Concept Activation Vector (TCAV) method from computer vision, to quantify the sensitivity of a trained model to the human-defined concepts of explicit and implicit abusive language, and use that to explain the generalizability of the model on new data, in this case, COVID-related anti-Asian hate speech. Extending this technique, we introduce a novel metric, Degree of Explicitness, for a single instance and show that the new metric is beneficial in suggesting out-of-domain unlabeled examples to effectively enrich the training data with informative, implicitly abusive texts.

下载PDF全文

下载文献需遵守相关版权规定

论文标题