论文标题
原理特征空间的分布式学习没有瞬间约束
Distributed Learning for Principle Eigenspaces without Moment Constraints
论文作者
论文摘要
已经研究了分布式主体组件分析(PCA),以应对数据跨多个机器存储的情况,并且通信成本或隐私问题禁止在中心位置计算PCA。但是,相关文献中高斯的假设在实际应用中是限制性的,在实际应用中,在金融和宏观经济等领域中,离群值或重尾数据很常见。在本文中,我们提出了一种分布式算法,用于估算原理本征空间,而无需对基础分布的任何时刻约束。我们研究了椭圆家庭框架下的问题,并采用样本多元kendall'tau矩阵来从所有子机器中提取特征阶段估计量,这可以看作是格拉曼歧管中的点。然后,我们将这些点的“中心”作为主要特征空间的最终分布式估计器。我们研究了分布式估计器的偏差和方差,并得出其收敛速率,该估计率取决于散点矩阵的有效等级和特征,以及亚基的数量。我们表明,分布式估算器的执行方式,好像我们可以完全访问整个数据。仿真研究表明,分布式算法与现有的算法相当用于轻尾数据,同时对重尾数据显示出极大的优势。我们还将算法扩展到椭圆因子模型的分布式学习,并通过实际应用到宏观经济数据集来验证其经验实用性。
Distributed Principal Component Analysis (PCA) has been studied to deal with the case when data are stored across multiple machines and communication cost or privacy concerns prohibit the computation of PCA in a central location. However, the sub-Gaussian assumption in the related literature is restrictive in real application where outliers or heavy-tailed data are common in areas such as finance and macroeconomic. In this article, we propose a distributed algorithm for estimating the principle eigenspaces without any moment constraint on the underlying distribution. We study the problem under the elliptical family framework and adopt the sample multivariate Kendall'tau matrix to extract eigenspace estimators from all sub-machines, which can be viewed as points in the Grassman manifold. We then find the "center" of these points as the final distributed estimator of the principal eigenspace. We investigate the bias and variance for the distributed estimator and derive its convergence rate which depends on the effective rank and eigengap of the scatter matrix, and the number of submachines. We show that the distributed estimator performs as if we have full access of whole data. Simulation studies show that the distributed algorithm performs comparably with the existing one for light-tailed data, while showing great advantage for heavy-tailed data. We also extend our algorithm to the distributed learning of elliptical factor models and verify its empirical usefulness through real application to a macroeconomic dataset.