论文标题
基于数据的转移学习观点
A Data-Based Perspective on Transfer Learning
论文作者
论文摘要
人们普遍认为,在转移学习中,包括更多的预训练数据可以转化为更好的性能。但是,最近的证据表明,从源数据集中删除数据实际上也可以提供帮助。在这项工作中,我们仔细研究了源数据集在转移学习中的作用,并提出了探索其对下游性能的影响的框架。我们的框架产生了新的功能,例如精确调整转移学习脆弱性以及检测诸如数据渗透的病理和源数据集中存在误导示例之类的病理。特别是,我们证明,通过框架确定的删除有害数据点可改善来自Imagenet的转移学习绩效,以实现各种目标任务。代码可从https://github.com/madrylab/data-transfer获得
It is commonly believed that in transfer learning including more pre-training data translates into better performance. However, recent evidence suggests that removing data from the source dataset can actually help too. In this work, we take a closer look at the role of the source dataset's composition in transfer learning and present a framework for probing its impact on downstream performance. Our framework gives rise to new capabilities such as pinpointing transfer learning brittleness as well as detecting pathologies such as data-leakage and the presence of misleading examples in the source dataset. In particular, we demonstrate that removing detrimental datapoints identified by our framework improves transfer learning performance from ImageNet on a variety of target tasks. Code is available at https://github.com/MadryLab/data-transfer