论文标题

不要责怪注释者:偏见已经从注释说明开始

Don't Blame the Annotator: Bias Already Starts in the Annotation Instructions

论文作者

Parmar, Mihir, Mishra, Swaroop, Geva, Mor, Baral, Chitta

论文摘要

近年来,NLU的进展是由基准驱动的。这些基准通常是通过众包收集的,在该基准中,注释者根据数据集创建者精心设计的注释指令编写示例。在这项工作中,我们假设注释者会在众包指令中使用模式,这使他们有偏向他们编写许多类似示例,然后在收集的数据中过度代表。我们在最近的14个NLU基准中研究了这种形式的偏见,称为指导偏见,表明指导示例经常表现出具体模式,这些模式由人群工作者传播到收集的数据中。这扩展了以前的工作(Geva等,2019),并引起了我们是否正在建模数据集创建者的说明而不是任务的新问题。通过一系列实验,我们表明,确实,指令偏见可以导致对模型性能的高估,并且该模型努力概括源于众包指令的偏见。我们进一步分析了教学偏差在模式频率和模型大小方面的影响,并得出了创建未来NLU基准测试的具体建议。

In recent years, progress in NLU has been driven by benchmarks. These benchmarks are typically collected by crowdsourcing, where annotators write examples based on annotation instructions crafted by dataset creators. In this work, we hypothesize that annotators pick up on patterns in the crowdsourcing instructions, which bias them to write many similar examples that are then over-represented in the collected data. We study this form of bias, termed instruction bias, in 14 recent NLU benchmarks, showing that instruction examples often exhibit concrete patterns, which are propagated by crowdworkers to the collected data. This extends previous work (Geva et al., 2019) and raises a new concern of whether we are modeling the dataset creator's instructions, rather than the task. Through a series of experiments, we show that, indeed, instruction bias can lead to overestimation of model performance, and that models struggle to generalize beyond biases originating in the crowdsourcing instructions. We further analyze the influence of instruction bias in terms of pattern frequency and model size, and derive concrete recommendations for creating future NLU benchmarks.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源