论文标题

数据采样影响在线SGD对依赖数据的复杂性

Data Sampling Affects the Complexity of Online SGD over Dependent Data

论文作者

Ma, Shaocong, Chen, Ziyi, Zhou, Yi, Ji, Kaiyi, Liang, Yingbin

论文摘要

常规的机器学习应用程序通常假定数据样本是独立和相同分布的(i.i.d。)。但是,实际场景通常涉及产生高度依赖数据样本的数据生成过程,这些过程众所周知,这些过程会严重偏向随机优化过程并减慢学习的收敛性。在本文中,我们对不同的随机数据采样方案如何影响在线随机梯度下降(SGD)对高度依赖数据的样本复杂性进行了基础研究。具体而言,使用$ ϕ $的数据依赖模型,我们表明在线SGD具有适当的定期数据 - 取样,可以在数据依赖性级别的整个频谱中比标准在线SGD的样本复杂性提高了。有趣的是,即使对数据样本的子集进行了采样,也可以在高度依赖的数据上加速在线SGD的收敛性。此外,我们表明,具有迷你批次采样的在线SGD可以通过定期的数据补充来实质上改善在线SGD的样本复杂性,而不是高度依赖的数据。数值实验验证了我们的理论结果。

Conventional machine learning applications typically assume that data samples are independently and identically distributed (i.i.d.). However, practical scenarios often involve a data-generating process that produces highly dependent data samples, which are known to heavily bias the stochastic optimization process and slow down the convergence of learning. In this paper, we conduct a fundamental study on how different stochastic data sampling schemes affect the sample complexity of online stochastic gradient descent (SGD) over highly dependent data. Specifically, with a $ϕ$-mixing model of data dependence, we show that online SGD with proper periodic data-subsampling achieves an improved sample complexity over the standard online SGD in the full spectrum of the data dependence level. Interestingly, even subsampling a subset of data samples can accelerate the convergence of online SGD over highly dependent data. Moreover, we show that online SGD with mini-batch sampling can further substantially improve the sample complexity over online SGD with periodic data-subsampling over highly dependent data. Numerical experiments validate our theoretical results.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源