论文标题
对数据生成过程进行建模是分布概括的必要条件
Modeling the Data-Generating Process is Necessary for Out-of-Distribution Generalization
论文作者
论文摘要
关于领域泛化(DG)的最新经验研究表明,在某些分布转移方面表现良好的DG算法失败了,并且没有最先进的DG算法在所有转移方面都表现出色。此外,现实世界中的数据通常在不同属性上具有多次分布变化。因此,我们介绍了多属性分配移位数据集,并发现现有DG算法的准确性进一步下降。为了解释这些结果,我们使用规范因果图在多属性偏移下进行了概括的形式表征。基于虚假属性与分类标签之间的关系,我们获得了特征表征共同分布移动的规范因果图的实现,并表明每个变化都需要对观察到的变量进行不同的独立性约束。结果,我们证明,基于单个固定约束的任何算法在所有转变中都不能很好地工作,从而为DG算法提供了混合经验结果的理论证据。基于此洞察力,我们开发了因果自适应约束最小化(CACM),该算法使用有关数据生成过程的知识来适应性地识别并应用正规化的正确独立性约束。完全合成,MNIST,小NORB和水鸟数据集的结果,涵盖了二进制和多价值属性和标签,表明自适应数据集依赖性约束导致在看不见的域上具有最高精度,而错误的约束则无法做到。我们的结果表明,建模数据生成过程中固有的因果关系的重要性。
Recent empirical studies on domain generalization (DG) have shown that DG algorithms that perform well on some distribution shifts fail on others, and no state-of-the-art DG algorithm performs consistently well on all shifts. Moreover, real-world data often has multiple distribution shifts over different attributes; hence we introduce multi-attribute distribution shift datasets and find that the accuracy of existing DG algorithms falls even further. To explain these results, we provide a formal characterization of generalization under multi-attribute shifts using a canonical causal graph. Based on the relationship between spurious attributes and the classification label, we obtain realizations of the canonical causal graph that characterize common distribution shifts and show that each shift entails different independence constraints over observed variables. As a result, we prove that any algorithm based on a single, fixed constraint cannot work well across all shifts, providing theoretical evidence for mixed empirical results on DG algorithms. Based on this insight, we develop Causally Adaptive Constraint Minimization (CACM), an algorithm that uses knowledge about the data-generating process to adaptively identify and apply the correct independence constraints for regularization. Results on fully synthetic, MNIST, small NORB, and Waterbirds datasets, covering binary and multi-valued attributes and labels, show that adaptive dataset-dependent constraints lead to the highest accuracy on unseen domains whereas incorrect constraints fail to do so. Our results demonstrate the importance of modeling the causal relationships inherent in the data-generating process.