论文标题
关于分发测试的价值:古哈特定律的示例
On the Value of Out-of-Distribution Testing: An Example of Goodhart's Law
论文作者
论文摘要
分布式(OOD)测试越来越流行,可以评估机器学习系统超出培训集的偏见的能力。 OOD基准旨在呈现培训和测试时间之间的数据和标签的不同联合分布。 VQA-CP已成为视觉问题回答的标准OOD基准,但是我们在当前使用中发现了三种令人不安的做法。首先,大多数已发表的方法都依赖于OOD分割的构建的明确知识。他们经常依靠``倒转''标签的分布,例如当常见培训答案为“否”时,主要回答“是”。其次,OOD测试集用于模型选择。第三,在显示标签的分布更加平衡的情况下,对模型的内域性能进行了重新构建(VQA V2)后进行评估。这三个实践打败了评估概括的目标,并质疑专门为该数据集设计的方法的价值。我们表明,令人尴尬的简单方法,包括随机生成答案的方法,超过了某些问题类型的最新方法。我们提供短期和长期解决方案,以避免这些陷阱,并实现OOD评估的好处。
Out-of-distribution (OOD) testing is increasingly popular for evaluating a machine learning system's ability to generalize beyond the biases of a training set. OOD benchmarks are designed to present a different joint distribution of data and labels between training and test time. VQA-CP has become the standard OOD benchmark for visual question answering, but we discovered three troubling practices in its current use. First, most published methods rely on explicit knowledge of the construction of the OOD splits. They often rely on ``inverting'' the distribution of labels, e.g. answering mostly 'yes' when the common training answer is 'no'. Second, the OOD test set is used for model selection. Third, a model's in-domain performance is assessed after retraining it on in-domain splits (VQA v2) that exhibit a more balanced distribution of labels. These three practices defeat the objective of evaluating generalization, and put into question the value of methods specifically designed for this dataset. We show that embarrassingly-simple methods, including one that generates answers at random, surpass the state of the art on some question types. We provide short- and long-term solutions to avoid these pitfalls and realize the benefits of OOD evaluation.