论文标题
与离群值的子空间近似
Subspace approximation with outliers
论文作者
论文摘要
The subspace approximation problem with outliers, for given $n$ points in $d$ dimensions $x_{1},\ldots, x_{n} \in R^{d}$, an integer $1 \leq k \leq d$, and an outlier parameter $0 \leq α\leq 1$, is to find a $k$-dimensional linear subspace of $ r^{d} $将平方距离的总和最小化至其最接近的$(1-α)n $点。更一般而言,$ \ ell_ {p} $子空间近似问题与离群值最小化$ p $ th的距离的总和而不是平方距离的总和。即使是鲁棒PCA的情况也不平淡,以前的工作也需要在输入上进行其他假设。与异常值相关的子空间近似问题的任何乘法近似算法都必须解决可靠的子空间恢复问题,这种特殊情况下,最佳解决方案中的$(1-α)n $ inliers被保证会完全存在于$ K $ dimensional-dimensional线性线性子空间。但是,强大的子空间恢复是小集扩展(SSE)-HARD。 我们展示了如何扩展基于采样的尺寸缩小技术和双标准近似值,以与离群值的子空间近似问题。为了绕过可靠的子空间恢复的SSE硬度,我们假设最佳$ k $二维子空间的平方距离误差与最佳$(1-α)n $ inliers相比至少为$Δ$δ$δ$ timpers y倍其平方error在所有$ n $点上汇总,对于某些$ 0<δδ\ leq leq leq leq 1--α$ n $。有了这个假设,我们给出了有效的算法,以找到$ poly(k/ε)\ log(1/δ)\ log \ log \ log \ log \ log \ log \ log \ log \ log(1/δ)$点的子集,其跨度包含$ k $ - 维的子空间,该子空间可为最佳解决方案提供多重$(1+ε)$ - 近似于最佳解决方案。我们的算法的运行时间是$ n $和$ d $的线性。有趣的是,只要满足明显的条件$ 0 <δ\ leq 1 -α$,我们的结果即使$α$的分数$α$也很大。
The subspace approximation problem with outliers, for given $n$ points in $d$ dimensions $x_{1},\ldots, x_{n} \in R^{d}$, an integer $1 \leq k \leq d$, and an outlier parameter $0 \leq α\leq 1$, is to find a $k$-dimensional linear subspace of $R^{d}$ that minimizes the sum of squared distances to its nearest $(1-α)n$ points. More generally, the $\ell_{p}$ subspace approximation problem with outliers minimizes the sum of $p$-th powers of distances instead of the sum of squared distances. Even the case of robust PCA is non-trivial, and previous work requires additional assumptions on the input. Any multiplicative approximation algorithm for the subspace approximation problem with outliers must solve the robust subspace recovery problem, a special case in which the $(1-α)n$ inliers in the optimal solution are promised to lie exactly on a $k$-dimensional linear subspace. However, robust subspace recovery is Small Set Expansion (SSE)-hard. We show how to extend dimension reduction techniques and bi-criteria approximations based on sampling to the problem of subspace approximation with outliers. To get around the SSE-hardness of robust subspace recovery, we assume that the squared distance error of the optimal $k$-dimensional subspace summed over the optimal $(1-α)n$ inliers is at least $δ$ times its squared-error summed over all $n$ points, for some $0 < δ\leq 1 - α$. With this assumption, we give an efficient algorithm to find a subset of $poly(k/ε) \log(1/δ) \log\log(1/δ)$ points whose span contains a $k$-dimensional subspace that gives a multiplicative $(1+ε)$-approximation to the optimal solution. The running time of our algorithm is linear in $n$ and $d$. Interestingly, our results hold even when the fraction of outliers $α$ is large, as long as the obvious condition $0 < δ\leq 1 - α$ is satisfied.