论文标题
从分散的非IID未标记数据中学习得益于自我监督?
Does Learning from Decentralized Non-IID Unlabeled Data Benefit from Self Supervision?
论文作者
论文摘要
已提倡分散的学习并广泛部署,以有效利用分布式数据集,并广泛关注监督学习(SL)问题。不幸的是,大多数现实数据是未标记的,并且在各种来源之间可能是高度异质的。在这项工作中,我们通过无标记的数据仔细研究了分散的学习(SSL),特别是对比度的视觉表示学习。我们研究了分散的学习设置下的一系列对比学习算法的有效性,包括Imagenet-100,MS-Coco和新的现实机器人仓库数据集(包括Imagenet-100,MS-Coco)的相对大型数据集。我们的实验表明,分散的SSL(DEC-SSL)方法对分散数据集的异质性是可靠的,并且学习了有用的表示对象分类,检测和分割任务的表示。这种鲁棒性使能够显着减少沟通并减少数据源的参与率,而性能下降最少。有趣的是,使用相同数量的数据,DEC-SSL学到的表示形式不仅可以与集中式SSL所学的同等作用,这需要沟通和过多的数据存储成本,而且有时甚至超过了从分散的SL提取的表达式,这需要有关数据标签的额外知识。最后,我们提供了理论见解,以了解为什么数据异质性不太关心DEC-SSL目标,并引入功能对齐和聚类技术来开发一种新的DEC-SSL算法,从而进一步改善了性能,面对高度非IID数据。我们的研究提供了积极的证据,可以在分散学习中接受未标记的数据,我们希望对是否有效分散的SSL提供新的见解。
Decentralized learning has been advocated and widely deployed to make efficient use of distributed datasets, with an extensive focus on supervised learning (SL) problems. Unfortunately, the majority of real-world data are unlabeled and can be highly heterogeneous across sources. In this work, we carefully study decentralized learning with unlabeled data through the lens of self-supervised learning (SSL), specifically contrastive visual representation learning. We study the effectiveness of a range of contrastive learning algorithms under decentralized learning settings, on relatively large-scale datasets including ImageNet-100, MS-COCO, and a new real-world robotic warehouse dataset. Our experiments show that the decentralized SSL (Dec-SSL) approach is robust to the heterogeneity of decentralized datasets, and learns useful representation for object classification, detection, and segmentation tasks. This robustness makes it possible to significantly reduce communication and reduce the participation ratio of data sources with only minimal drops in performance. Interestingly, using the same amount of data, the representation learned by Dec-SSL can not only perform on par with that learned by centralized SSL which requires communication and excessive data storage costs, but also sometimes outperform representations extracted from decentralized SL which requires extra knowledge about the data labels. Finally, we provide theoretical insights into understanding why data heterogeneity is less of a concern for Dec-SSL objectives, and introduce feature alignment and clustering techniques to develop a new Dec-SSL algorithm that further improves the performance, in the face of highly non-IID data. Our study presents positive evidence to embrace unlabeled data in decentralized learning, and we hope to provide new insights into whether and why decentralized SSL is effective.