经验风险最小化的普遍性

论文标题

经验风险最小化的普遍性

Universality of empirical risk minimization

论文作者

Montanari, Andrea, Saeed, Basil

论文摘要

考虑从I.I.D.监督学习样本$ \ {{{\ boldsymbol x} _i，y_i \} _ {i \ le n} $，其中$ {\ boldsymbol x} _i \ in \ in \ mathbb {r}^p $是featial vectors and features vectors and $ {y} \ in \ mathbb {y mathbb {y mathbb> rab eave vectors。我们研究了由$ \ mathsf {k} = o（1）$ vectors $ {\boldsymbolθ} _1的一类函数的经验风险最小化。。。，{\boldsymbolθ} _ {\ Mathsf k} \ in \ Mathbb {r}^p $，并证明了训练和测试错误的普遍性结果。也就是说，在比例渐近学$ n，p \ to \ infty $的情况下，$ n/p =θ（1）$，我们证明训练误差仅通过其协方差结构来取决于随机特征分布。此外，我们证明，近经验风险最小化器的最小测试误差具有相似的普遍性。特别是，在更简单的型号下，可以将这些数量的渐近学计算为$ - $ - $ - $ - $ {\ boldsymbol x} _i $被高斯矢量$ {\ boldsymbol g} _i $替换为具有相同的共价。早期的普遍性结果仅限于强烈凸学习过程，或者具有矢量$ {\ boldsymbol x} _i $具有独立条目。我们的结果并未做出这些假设。我们的假设足够一般，可以包含特征向量$ {\ boldsymbol x} _i $，这些_ $由随机特征映射产生。特别是，我们明确检查某些随机特征模型（计算具有随机权重的单层神经网络的输出）和神经切线模型（两层网络的一阶Taylor近似）的假设。

Consider supervised learning from i.i.d. samples $\{{\boldsymbol x}_i,y_i\}_{i\le n}$ where ${\boldsymbol x}_i \in\mathbb{R}^p$ are feature vectors and ${y} \in \mathbb{R}$ are labels. We study empirical risk minimization over a class of functions that are parameterized by $\mathsf{k} = O(1)$ vectors ${\boldsymbol θ}_1, . . . , {\boldsymbol θ}_{\mathsf k} \in \mathbb{R}^p$ , and prove universality results both for the training and test error. Namely, under the proportional asymptotics $n,p\to\infty$, with $n/p = Θ(1)$, we prove that the training error depends on the random features distribution only through its covariance structure. Further, we prove that the minimum test error over near-empirical risk minimizers enjoys similar universality properties. In particular, the asymptotics of these quantities can be computed $-$to leading order$-$ under a simpler model in which the feature vectors ${\boldsymbol x}_i$ are replaced by Gaussian vectors ${\boldsymbol g}_i$ with the same covariance. Earlier universality results were limited to strongly convex learning procedures, or to feature vectors ${\boldsymbol x}_i$ with independent entries. Our results do not make any of these assumptions. Our assumptions are general enough to include feature vectors ${\boldsymbol x}_i$ that are produced by randomized featurization maps. In particular we explicitly check the assumptions for certain random features models (computing the output of a one-layer neural network with random weights) and neural tangent models (first-order Taylor approximation of two-layer networks).

下载PDF全文

下载文献需遵守相关版权规定

论文标题