数据采样影响在线SGD对依赖数据的复杂性

论文标题

数据采样影响在线SGD对依赖数据的复杂性

Data Sampling Affects the Complexity of Online SGD over Dependent Data

论文作者

Ma, Shaocong, Chen, Ziyi, Zhou, Yi, Ji, Kaiyi, Liang, Yingbin

论文摘要

常规的机器学习应用程序通常假定数据样本是独立和相同分布的（i.i.d。）。但是，实际场景通常涉及产生高度依赖数据样本的数据生成过程，这些过程众所周知，这些过程会严重偏向随机优化过程并减慢学习的收敛性。在本文中，我们对不同的随机数据采样方案如何影响在线随机梯度下降（SGD）对高度依赖数据的样本复杂性进行了基础研究。具体而言，使用$ ϕ $的数据依赖模型，我们表明在线SGD具有适当的定期数据 - 取样，可以在数据依赖性级别的整个频谱中比标准在线SGD的样本复杂性提高了。有趣的是，即使对数据样本的子集进行了采样，也可以在高度依赖的数据上加速在线SGD的收敛性。此外，我们表明，具有迷你批次采样的在线SGD可以通过定期的数据补充来实质上改善在线SGD的样本复杂性，而不是高度依赖的数据。数值实验验证了我们的理论结果。

Conventional machine learning applications typically assume that data samples are independently and identically distributed (i.i.d.). However, practical scenarios often involve a data-generating process that produces highly dependent data samples, which are known to heavily bias the stochastic optimization process and slow down the convergence of learning. In this paper, we conduct a fundamental study on how different stochastic data sampling schemes affect the sample complexity of online stochastic gradient descent (SGD) over highly dependent data. Specifically, with a $ϕ$-mixing model of data dependence, we show that online SGD with proper periodic data-subsampling achieves an improved sample complexity over the standard online SGD in the full spectrum of the data dependence level. Interestingly, even subsampling a subset of data samples can accelerate the convergence of online SGD over highly dependent data. Moreover, we show that online SGD with mini-batch sampling can further substantially improve the sample complexity over online SGD with periodic data-subsampling over highly dependent data. Numerical experiments validate our theoretical results.

下载PDF全文

下载文献需遵守相关版权规定

论文标题