具有基于生成及时的推理的毒性检测

论文标题

具有基于生成及时的推理的毒性检测

Toxicity Detection with Generative Prompt-based Inference

论文作者

Wang, Yau-Shian, Chang, Yingshan

论文摘要

由于不同人感知的微妙，意义和不同的可能解释，从文本中检测不良内容是一个细微的困难。曾经在包含不良内容的语料库训练的语言训练的语言模型（LMS）具有表现偏见和毒性的力量是一种悠久的风险。但是，最近的研究表明，作为一种补救措施，LMS也能够鉴定有毒内容而无需进行其他微调。已显示迅速的方法有效地收获了这种令人惊讶的自我诊断能力。但是，现有的基于及时的方法通常以歧视方式指定语言模型的指令。在这项工作中，我们通过及时的工程进行了全面试验，探讨了零摄像及时的毒性检测的生成变体。我们在三个数据集上评估了社交媒体帖子上注释的毒性标签。我们的分析强调了我们的生成分类方法的优势。讨论了自我诊断的有趣方面及其道德意义。

Due to the subtleness, implicity, and different possible interpretations perceived by different people, detecting undesirable content from text is a nuanced difficulty. It is a long-known risk that language models (LMs), once trained on corpus containing undesirable content, have the power to manifest biases and toxicity. However, recent studies imply that, as a remedy, LMs are also capable of identifying toxic content without additional fine-tuning. Prompt-methods have been shown to effectively harvest this surprising self-diagnosing capability. However, existing prompt-based methods usually specify an instruction to a language model in a discriminative way. In this work, we explore the generative variant of zero-shot prompt-based toxicity detection with comprehensive trials on prompt engineering. We evaluate on three datasets with toxicity labels annotated on social media posts. Our analysis highlights the strengths of our generative classification approach both quantitatively and qualitatively. Interesting aspects of self-diagnosis and its ethical implications are discussed.

下载PDF全文

下载文献需遵守相关版权规定

论文标题