论文标题
性别偏见如何影响内部模型表示以及为什么重要
How Gender Debiasing Affects Internal Model Representations, and Why It Matters
论文作者
论文摘要
NLP中性别偏差的常见研究重点是通过模型性能在下游任务上测得的外部偏见,或者是模型内部表示中发现的内在偏见。但是,外在偏见与内在偏见之间的关系相对尚不清楚。在这项工作中,我们通过将两个量一起测量两个量来阐明这种关系:我们在下游微调过程中为模型做出了一种模型,从而减少了外部偏见,并衡量对内在偏见的影响,该偏见的影响是通过信息理论探测为偏见提取性的。通过对两个任务和多个偏见指标的实验,我们表明我们的固有偏见度量是比标准WEAT度量标准的更好的偏见指标,并且还可以暴露出浅表性偏见的案例。我们的框架为NLP模型中的偏见提供了全面的看法,该观点可用于以更明智的方式部署NLP系统。我们的代码和型号检查点公开可用。
Common studies of gender bias in NLP focus either on extrinsic bias measured by model performance on a downstream task or on intrinsic bias found in models' internal representations. However, the relationship between extrinsic and intrinsic bias is relatively unknown. In this work, we illuminate this relationship by measuring both quantities together: we debias a model during downstream fine-tuning, which reduces extrinsic bias, and measure the effect on intrinsic bias, which is operationalized as bias extractability with information-theoretic probing. Through experiments on two tasks and multiple bias metrics, we show that our intrinsic bias metric is a better indicator of debiasing than (a contextual adaptation of) the standard WEAT metric, and can also expose cases of superficial debiasing. Our framework provides a comprehensive perspective on bias in NLP models, which can be applied to deploy NLP systems in a more informed manner. Our code and model checkpoints are publicly available.