论文标题
我们的评估指标需要更新以鼓励概括
Our Evaluation Metric Needs an Update to Encourage Generalization
论文作者
论文摘要
超过几个流行基准测试的人类表现的模型在暴露于分布(OOD)数据时的性能会显着降解。最近的研究表明,模型过于拟合虚假的偏见和“黑客”数据集,以代替学习人类等可概括的特征。为了停止模型性能的通货膨胀,并因此高估了AI系统的功能 - 我们提出了一个简单而新颖的评估度量,木材评分,鼓励在评估过程中进行概括。
Models that surpass human performance on several popular benchmarks display significant degradation in performance on exposure to Out of Distribution (OOD) data. Recent research has shown that models overfit to spurious biases and `hack' datasets, in lieu of learning generalizable features like humans. In order to stop the inflation in model performance -- and thus overestimation in AI systems' capabilities -- we propose a simple and novel evaluation metric, WOOD Score, that encourages generalization during evaluation.