论文标题
在转型不断增长下的PAC可学习性理论
A Theory of PAC Learnability under Transformation Invariances
论文作者
论文摘要
在许多现实世界中,变换不断存在。例如,图像分类通常是旋转和颜色转换的不变性:以不同颜色的旋转汽车仍被识别为汽车。数据增强将转换的数据添加到培训集中并在增强数据上训练模型,是一种通常使用这些不断增长的技术。但是,目前尚不清楚数据增强在理论上的性能以及在发生变换不变的情况下的最佳算法是什么。在本文中,我们根据不同级别的可实现性研究在三种环境中的转化不变性下的PAC可学习性:(i)假设符合增强数据; (ii)一个假设仅拟合原始数据,而转换的数据则位于数据分布的支持; (iii)不可知案。一个有趣的观察结果是,要在设置(II)和(iii)中区分原始数据和转换数据是必要的,这意味着任何不区分原始数据和转换数据(包括数据增强)的算法并不是最佳的。此外,这种类型的算法甚至可以“伤害”准确性。在设置(i)中,尽管不需要区分两个数据集,但是数据增强仍然不能最佳地执行。由于这种差异,我们提出了两种组合措施,表征了设置(i)和(ii)(iii)中最佳样品复杂性并提供最佳算法。
Transformation invariances are present in many real-world problems. For example, image classification is usually invariant to rotation and color transformation: a rotated car in a different color is still identified as a car. Data augmentation, which adds the transformed data into the training set and trains a model on the augmented data, is one commonly used technique to build these invariances into the learning process. However, it is unclear how data augmentation performs theoretically and what the optimal algorithm is in presence of transformation invariances. In this paper, we study PAC learnability under transformation invariances in three settings according to different levels of realizability: (i) A hypothesis fits the augmented data; (ii) A hypothesis fits only the original data and the transformed data lying in the support of the data distribution; (iii) Agnostic case. One interesting observation is that distinguishing between the original data and the transformed data is necessary to achieve optimal accuracy in setting (ii) and (iii), which implies that any algorithm not differentiating between the original and transformed data (including data augmentation) is not optimal. Furthermore, this type of algorithms can even "harm" the accuracy. In setting (i), although it is unnecessary to distinguish between the two data sets, data augmentation still does not perform optimally. Due to such a difference, we propose two combinatorial measures characterizing the optimal sample complexity in setting (i) and (ii)(iii) and provide the optimal algorithms.