压缩深网络：再见SVD，Hello稳健的低级近似

论文标题

压缩深网络：再见SVD，Hello稳健的低级近似

Compressed Deep Networks: Goodbye SVD, Hello Robust Low-Rank Approximation

论文作者

Tukan, Murad, Maalouf, Alaa, Weksler, Matan, Feldman, Dan

论文摘要

压缩神经网络的一种常见技术是计算矩阵$ a \ in \ in \ mathbb {r}^{r}^{n \ times d} $的$ k $ - rank $ \ ell_2 $ a_2 $ a_ <ell_2 $ a_ {k，2} $，与完全连接的层（或胚胎层）。在这里，$ d $是图层中的神经元的数量，$ n $是下一个神经元的数量，$ a_ {k，2} $可以存储在$ o（（n+d）k）$内存中，而不是$ o（nd）$中。此$ \ ell_2 $ - approximation将每个条目的总和最小化$ p = 2 $的矩阵$ a -a_ a _ {k，2} $中的每个矩阵$ a_ {k，2} \ in \ mathbb {r}^r}^n \ times d} $ k $ k $ $ k $。虽然可以通过SVD有效地计算出来，但已知$ \ ell_2 $ - approximation对异常值非常敏感（“远处”行）。因此，机器学习使用了，例如拉索回归，$ \ ell_1 $ - regularization和$ \ ell_1 $ -svm使用$ \ ell_1 $ -norm。本文建议用$ \ ell_p $替换$ k $ -rank $ \ ell_2 $近似值，对于$ p \ in [1,2] $。然后，我们根据计算几何技术的现代技术提供实用且可证明的近似算法，以对任何$ p \ geq1 $进行计算。在压缩Bert，Distilbert，XLNET和Roberta的胶水基准上进行的广泛实验结果证实了这一理论优势。例如，我们的方法达到了$ 28 \％$的嵌入层压缩，在所有胶水中的所有任务中，准确性（不进行微调）的添加度仅为$ 0.63 \％$，而使用现有$ \ ell_2 $ \％$下降，而$ 11 \％$下降。提供开放代码，用于复制和扩展我们的结果。

A common technique for compressing a neural network is to compute the $k$-rank $\ell_2$ approximation $A_{k,2}$ of the matrix $A\in\mathbb{R}^{n\times d}$ that corresponds to a fully connected layer (or embedding layer). Here, $d$ is the number of the neurons in the layer, $n$ is the number in the next one, and $A_{k,2}$ can be stored in $O((n+d)k)$ memory instead of $O(nd)$. This $\ell_2$-approximation minimizes the sum over every entry to the power of $p=2$ in the matrix $A - A_{k,2}$, among every matrix $A_{k,2}\in\mathbb{R}^{n\times d}$ whose rank is $k$. While it can be computed efficiently via SVD, the $\ell_2$-approximation is known to be very sensitive to outliers ("far-away" rows). Hence, machine learning uses e.g. Lasso Regression, $\ell_1$-regularization, and $\ell_1$-SVM that use the $\ell_1$-norm. This paper suggests to replace the $k$-rank $\ell_2$ approximation by $\ell_p$, for $p\in [1,2]$. We then provide practical and provable approximation algorithms to compute it for any $p\geq1$, based on modern techniques in computational geometry. Extensive experimental results on the GLUE benchmark for compressing BERT, DistilBERT, XLNet, and RoBERTa confirm this theoretical advantage. For example, our approach achieves $28\%$ compression of RoBERTa's embedding layer with only $0.63\%$ additive drop in the accuracy (without fine-tuning) in average over all tasks in GLUE, compared to $11\%$ drop using the existing $\ell_2$-approximation. Open code is provided for reproducing and extending our results.

下载PDF全文

下载文献需遵守相关版权规定

论文标题