论文标题
非线性优势:训练有素的网络可能不像您想象的那样复杂
Nonlinear Advantage: Trained Networks Might Not Be As Complex as You Think
论文作者
论文摘要
我们对深网的行为进行了实证研究,当通过稀疏性在网络中的非线性单元的总数上进行稀疏性,将其某些特征通道完全线性化。在图像分类和机器翻译任务的实验中,我们研究了在性能崩溃之前可以简化网络函数的程度。首先,当早期降低网络功能的非线性(而不是训练后期)时,我们会观察到显着的性能差距,这与最近对数据依赖性NTK的时间进化的观察结果一致。其次,我们发现训练后,我们能够在保持高性能的同时线性化大量的非线性单元,这表明网络的表现力中的大部分仍然没有使用,但在培训的早期阶段有助于梯度下降。为了表征所得部分线性化网络的深度,我们引入了一种称为平均路径长度的度量,代表沿网络图中路径遇到的活动非线性的平均数量。在稀疏性压力下,我们发现其余的非线性单元组织成不同的结构,形成了几乎恒定的有效深度和宽度的核心网络,这又取决于任务难度。
We perform an empirical study of the behaviour of deep networks when fully linearizing some of its feature channels through a sparsity prior on the overall number of nonlinear units in the network. In experiments on image classification and machine translation tasks, we investigate how much we can simplify the network function towards linearity before performance collapses. First, we observe a significant performance gap when reducing nonlinearity in the network function early on as opposed to late in training, in-line with recent observations on the time-evolution of the data-dependent NTK. Second, we find that after training, we are able to linearize a significant number of nonlinear units while maintaining a high performance, indicating that much of a network's expressivity remains unused but helps gradient descent in early stages of training. To characterize the depth of the resulting partially linearized network, we introduce a measure called average path length, representing the average number of active nonlinearities encountered along a path in the network graph. Under sparsity pressure, we find that the remaining nonlinear units organize into distinct structures, forming core-networks of near constant effective depth and width, which in turn depend on task difficulty.