是什么导致测试错误？超越通过方差分析的偏置变异

论文标题

是什么导致测试错误？超越通过方差分析的偏置变异

What causes the test error? Going beyond bias-variance via ANOVA

论文作者

Lin, Licong, Dobriban, Edgar

论文摘要

现代的机器学习方法通常被过多地分配，从而使数据适应了良好的水平。这似乎令人困惑。在最坏的情况下，这样的模型无需概括。这个难题激发了大量的工作，当过度参数化降低测试误差时，以“双重下降”的形式减少了测试误差。最近的工作旨在更深入地了解为什么过度散热有助于概括。这导致发现差异的非兴趣性是参数化水平的函数，并将差异分解为训练数据中标签噪声，初始化和随机性引起的方差，以了解误差的来源。在这项工作中，我们对这一领域有了更深入的了解。具体而言，我们建议使用方差分析（ANOVA）以对称方式分解测试误差的方差，以研究某些两层线性和非线性网络的概括性能。方差分析的优势在于，它比以前的方法更清楚地揭示了初始化，标签噪声和训练数据的影响。此外，我们还研究方差成分的单调性和单型性。尽管先前的工作研究了整体差异的非兴趣性，但我们研究了每个项在方差分解中的特性。一个关键的见解是，在典型的设置中，训练样本与初始化之间的相互作用可以主导差异。出乎意料的是，比其边际效应大。同样，我们表征了“相变”，其中差异从单峰变为单调。在技术层面上，我们利用了HAAR随机矩阵的先进确定性等效技术，据我们所知，该地区尚未在该地区使用。我们还通过数值模拟和经验数据示例来验证我们的结果。

Modern machine learning methods are often overparametrized, allowing adaptation to the data at a fine level. This can seem puzzling; in the worst case, such models do not need to generalize. This puzzle inspired a great amount of work, arguing when overparametrization reduces test error, in a phenomenon called "double descent". Recent work aimed to understand in greater depth why overparametrization is helpful for generalization. This leads to discovering the unimodality of variance as a function of the level of parametrization, and to decomposing the variance into that arising from label noise, initialization, and randomness in the training data to understand the sources of the error. In this work we develop a deeper understanding of this area. Specifically, we propose using the analysis of variance (ANOVA) to decompose the variance in the test error in a symmetric way, for studying the generalization performance of certain two-layer linear and non-linear networks. The advantage of the analysis of variance is that it reveals the effects of initialization, label noise, and training data more clearly than prior approaches. Moreover, we also study the monotonicity and unimodality of the variance components. While prior work studied the unimodality of the overall variance, we study the properties of each term in variance decomposition. One key insight is that in typical settings, the interaction between training samples and initialization can dominate the variance; surprisingly being larger than their marginal effect. Also, we characterize "phase transitions" where the variance changes from unimodal to monotone. On a technical level, we leverage advanced deterministic equivalent techniques for Haar random matrices, that -- to our knowledge -- have not yet been used in the area. We also verify our results in numerical simulations and on empirical data examples.

下载PDF全文

下载文献需遵守相关版权规定

论文标题