论文标题
深神网络中格式塔分组的混合证据
Mixed Evidence for Gestalt Grouping in Deep Neural Networks
论文作者
论文摘要
格斯塔尔特心理学家已经确定了一系列条件,其中人类将场景的元素组织成一个或整体,而感知分组原理在场景感知和对象识别中起着至关重要的作用。最近,基于报道,他们在各种大脑和行为基准上表现良好,因此提出了对自然图像(Imagenet)训练的深层神经网络(DNN)。在这里,我们总共测试了16个网络,其中涵盖了各种架构和学习范式(卷积,基于注意力,监督和自我监督,自我监督,喂养和经常性)(实验1)和更复杂的形状(实验2)刺激,从而在人类中产生强烈的gestalts效应。在实验1中,我们发现卷积网络确实以人类的方式对接近,线性和方向的原理敏感,但仅在输出层。在实验2中,我们发现大多数网络仅在几组方面表现出Gestalt效果,并且仅在最新处理阶段。总的来说,就人类的相似性而言,自我监督和视力转化器的性能似乎比卷积网络差。值得注意的是,没有模型在处理的早期或中级阶段呈现分组效应。这与在物体识别之前发生的格式塔发生的广泛假设不一致,并且确实是为了识别对象识别而组织的视觉场景。我们的总体结论是,尽管值得注意的是,在简单的2D图像上训练的网络支持在输出层处于某些刺激的格式塔格分组的形式,但此功能似乎并没有传递到更复杂的功能。此外,该分组仅发生在最后一层,这一事实表明,网络学习与人类的感知特性根本不同。
Gestalt psychologists have identified a range of conditions in which humans organize elements of a scene into a group or whole, and perceptual grouping principles play an essential role in scene perception and object identification. Recently, Deep Neural Networks (DNNs) trained on natural images (ImageNet) have been proposed as compelling models of human vision based on reports that they perform well on various brain and behavioral benchmarks. Here we test a total of 16 networks covering a variety of architectures and learning paradigms (convolutional, attention-based, supervised and self-supervised, feed-forward and recurrent) on dots (Experiment 1) and more complex shapes (Experiment 2) stimuli that produce strong Gestalts effects in humans. In Experiment 1 we found that convolutional networks were indeed sensitive in a human-like fashion to the principles of proximity, linearity, and orientation, but only at the output layer. In Experiment 2, we found that most networks exhibited Gestalt effects only for a few sets, and again only at the latest stage of processing. Overall, self-supervised and Vision-Transformer appeared to perform worse than convolutional networks in terms of human similarity. Remarkably, no model presented a grouping effect at the early or intermediate stages of processing. This is at odds with the widespread assumption that Gestalts occur prior to object recognition, and indeed, serve to organize the visual scene for the sake of object recognition. Our overall conclusion is that, albeit noteworthy that networks trained on simple 2D images support a form of Gestalt grouping for some stimuli at the output layer, this ability does not seem to transfer to more complex features. Additionally, the fact that this grouping only occurs at the last layer suggests that networks learn fundamentally different perceptual properties than humans.