通过本地化的$ \ varepsilon $ -covers的随机梯度下降的概括范围

论文标题

通过本地化的$ \ varepsilon $ -covers的随机梯度下降的概括范围

Generalization Bounds for Stochastic Gradient Descent via Localized $\varepsilon$-Covers

论文作者

Park, Sejun, Şimşekli, Umut, Erdogdu, Murat A.

论文摘要

在本文中，我们提出了一种针对SGD轨迹的新覆盖技术。该定位提供了一种算法特异性的复杂性，该算法由覆盖数来衡量，与标准均匀覆盖的参数相比，该范围独立于维度的基数，从而导致指数尺寸依赖性。基于这种本地化结构，我们表明，如果目标函数是分段的有限扰动，则用$ p $零件强烈凸出和光滑函数，即一般而言的非convex和非平滑误差，则概括性错误可以由$ o（\ sqrt {（\ sqrt {（\ sqrt {（\ log n \ log（np log log（np）/np）/n n}）$ n是$ n $ nes $ is $ nes $特别是，此速率与维度无关，并且不需要尽早停止和衰减的步骤。最后，我们在各种环境中采用这些结果，并为多类线性模型，多级支持向量机和$ k $ -MEANS聚类用于硬和软标签设置，从而提高了已知的最先进的速率。

In this paper, we propose a new covering technique localized for the trajectories of SGD. This localization provides an algorithm-specific complexity measured by the covering number, which can have dimension-independent cardinality in contrast to standard uniform covering arguments that result in exponential dimension dependency. Based on this localized construction, we show that if the objective function is a finite perturbation of a piecewise strongly convex and smooth function with $P$ pieces, i.e. non-convex and non-smooth in general, the generalization error can be upper bounded by $O(\sqrt{(\log n\log(nP))/n})$, where $n$ is the number of data samples. In particular, this rate is independent of dimension and does not require early stopping and decaying step size. Finally, we employ these results in various contexts and derive generalization bounds for multi-index linear models, multi-class support vector machines, and $K$-means clustering for both hard and soft label setups, improving the known state-of-the-art rates.

下载PDF全文

下载文献需遵守相关版权规定

论文标题