论文标题

学习建模并忽略混合容量合奏的数据集偏差

Learning to Model and Ignore Dataset Bias with Mixed Capacity Ensembles

论文作者

Clark, Christopher, Yatskar, Mark, Zettlemoyer, Luke

论文摘要

已显示许多数据集包含数据收集过程中特质创建的附带相关性。例如,如果几乎所有矛盾的句子都包含“不”一词,并且图像识别数据集如果狗总是在室内,则图像识别数据集可以具有讲述的对象相关性。在本文中,我们提出了一种可以自动检测和忽略这些数据集特异性模式的方法,我们称之为数据集偏差。我们的方法在具有较高容量模型的合奏中训练较低的容量模型。在培训期间,较低的容量模型学会了捕获相对较浅的相关性,我们假设这可能反映了数据集偏差。这释放了更高的能力模型,以专注于应该更好地推广的模式。我们通过引入一种新颖的方法来使它们有条件独立,以确保模型学习非重叠的方法。重要的是,我们的方法不需要事先知道偏见。我们评估了合成数据集的性能,以及构建的四个数据集,以惩罚在文本需要,视觉问题答案和图像识别任务上利用已知偏见的模型。我们在所有设置中都显示出改进,包括在回答数据集的视觉问题上获得10点增益。

Many datasets have been shown to contain incidental correlations created by idiosyncrasies in the data collection process. For example, sentence entailment datasets can have spurious word-class correlations if nearly all contradiction sentences contain the word "not", and image recognition datasets can have tell-tale object-background correlations if dogs are always indoors. In this paper, we propose a method that can automatically detect and ignore these kinds of dataset-specific patterns, which we call dataset biases. Our method trains a lower capacity model in an ensemble with a higher capacity model. During training, the lower capacity model learns to capture relatively shallow correlations, which we hypothesize are likely to reflect dataset bias. This frees the higher capacity model to focus on patterns that should generalize better. We ensure the models learn non-overlapping approaches by introducing a novel method to make them conditionally independent. Importantly, our approach does not require the bias to be known in advance. We evaluate performance on synthetic datasets, and four datasets built to penalize models that exploit known biases on textual entailment, visual question answering, and image recognition tasks. We show improvement in all settings, including a 10 point gain on the visual question answering dataset.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源