论文标题
一种基于关键字的方法,用于理解Twitter上英语边际滥用模型对边缘化群体的过度化
A Keyword Based Approach to Understanding the Overpenalization of Marginalized Groups by English Marginal Abuse Models on Twitter
论文作者
论文摘要
有害的内容检测模型倾向于对边缘化组的内容具有较高的假阳性率。在Twitter上的边际滥用模型的背景下,这种不成比例的惩罚构成了可见性降低的风险,在这种情况下,边缘化社区失去了在平台上表达意见的机会。当前的减轻算法损害的方法,NLP模型的偏置检测通常非常适应,并且会受到人类偏见的影响。我们在本文中做出了两个主要贡献。首先,我们设计了一种新颖的方法,该方法提供了一种原则方法来检测和测量与基于文本的模型相关的潜在危害的严重性。其次,我们将方法论应用于审计Twitter的英语边际滥用模型,该模型用于消除略有虐待内容的放大资格。在不使用人口统计标签或方言分类器的情况下,我们仍然能够检测和衡量与边缘化社区的言语过度化有关的问题的严重性,例如使用回收的语音,反语言和身份与身份相关的术语。为了减轻相关的危害,我们尝试添加其他真实的负面示例,发现这样做可以改善我们的公平度量指标,而不会在模型性能中大量退化。
Harmful content detection models tend to have higher false positive rates for content from marginalized groups. In the context of marginal abuse modeling on Twitter, such disproportionate penalization poses the risk of reduced visibility, where marginalized communities lose the opportunity to voice their opinion on the platform. Current approaches to algorithmic harm mitigation, and bias detection for NLP models are often very ad hoc and subject to human bias. We make two main contributions in this paper. First, we design a novel methodology, which provides a principled approach to detecting and measuring the severity of potential harms associated with a text-based model. Second, we apply our methodology to audit Twitter's English marginal abuse model, which is used for removing amplification eligibility of marginally abusive content. Without utilizing demographic labels or dialect classifiers, we are still able to detect and measure the severity of issues related to the over-penalization of the speech of marginalized communities, such as the use of reclaimed speech, counterspeech, and identity related terms. In order to mitigate the associated harms, we experiment with adding additional true negative examples and find that doing so provides improvements to our fairness metrics without large degradations in model performance.