论文标题

通过最佳传输的几何数据集距离

Geometric Dataset Distances via Optimal Transport

论文作者

Alvarez-Melis, David, Fusi, Nicolò

论文摘要

任务相似性的概念是各种机器学习范式的核心,例如域适应和元学习。当前量化它的方法通常是启发式的,对整个任务的标签集做出了强有力的假设,并且许多方法取决于体系结构,依赖于特定于任务的最佳参数(例如,需要在每个数据集中培训模型)。在这项工作中,我们提出了(i)是模型 - 静态的数据集之间距离的替代概念,(ii)不涉及培训,(iii)也可以比较数据集,即使其标签集完全不相关,并且(iv)具有牢固的理论步行。此距离依赖于最佳运输,这为其提供了丰富的几何意识,可解释的对应关系和众所周知的特性。我们的结果表明,这种新颖的距离提供了对数据集的有意义比较,并且与在各种实验设置和数据集中的传输学习硬度良好相关。

The notion of task similarity is at the core of various machine learning paradigms, such as domain adaptation and meta-learning. Current methods to quantify it are often heuristic, make strong assumptions on the label sets across the tasks, and many are architecture-dependent, relying on task-specific optimal parameters (e.g., require training a model on each dataset). In this work we propose an alternative notion of distance between datasets that (i) is model-agnostic, (ii) does not involve training, (iii) can compare datasets even if their label sets are completely disjoint and (iv) has solid theoretical footing. This distance relies on optimal transport, which provides it with rich geometry awareness, interpretable correspondences and well-understood properties. Our results show that this novel distance provides meaningful comparison of datasets, and correlates well with transfer learning hardness across various experimental settings and datasets.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源