论文标题
深度学习符合投影集群
Deep Learning Meets Projective Clustering
论文作者
论文摘要
A common approach for compressing NLP networks is to encode the embedding layer as a matrix $A\in\mathbb{R}^{n\times d}$, compute its rank-$j$ approximation $A_j$ via SVD, and then factor $A_j$ into a pair of matrices that correspond to smaller fully-connected layers to replace the original embedding layer.从几何上讲,$ a $的行代表$ \ mathbb {r}^d $中的点,$ a_j $的行代表了他们对$ j $二维子空间的投影,该子空间最小化了平方距离(“错误”)的总和。实际上,这些$ a $的行可能分布在$ k> 1 $的子空间左右,因此基于单个子空间的$ a $的考虑可能会导致大的错误,从而变成大量的准确性下降。 受到计算几何形状的\ emph {投影群集}的启发,我们建议将此子空间替换为$ k $子空间(dimension $ j $),这将每个点($ a $ in $ a $)的平方距离总和最小化,以最大程度地减少其\ emph {$ emph {$ emph {a $ a $ a $ {closept}} subspace。基于这种方法,我们提供了一种新颖的体系结构,将原始嵌入层替换为一组$ k $的小层,这些层并行操作,然后与单个完全连接的层重新组合。 与标准矩阵分解(SVD)相比,胶水基准收益网络的广泛实验结果既更准确又小。例如,我们通过将嵌入层的大小减少$ 40 \%$,而在所有九个胶水任务中仅产生$ 0.5 \%$的平均准确度下降,而使用现有SVD方法,则进一步压缩了Distilbert。在罗伯塔(Roberta)上,我们达到了$ 43 \%$的嵌入层压缩,而$ 0.8 \%$的平均准确度下降,而先前的$ 3 \%$下降。提供了用于复制和扩展我们的结果的开放代码。
A common approach for compressing NLP networks is to encode the embedding layer as a matrix $A\in\mathbb{R}^{n\times d}$, compute its rank-$j$ approximation $A_j$ via SVD, and then factor $A_j$ into a pair of matrices that correspond to smaller fully-connected layers to replace the original embedding layer. Geometrically, the rows of $A$ represent points in $\mathbb{R}^d$, and the rows of $A_j$ represent their projections onto the $j$-dimensional subspace that minimizes the sum of squared distances ("errors") to the points. In practice, these rows of $A$ may be spread around $k>1$ subspaces, so factoring $A$ based on a single subspace may lead to large errors that turn into large drops in accuracy. Inspired by \emph{projective clustering} from computational geometry, we suggest replacing this subspace by a set of $k$ subspaces, each of dimension $j$, that minimizes the sum of squared distances over every point (row in $A$) to its \emph{closest} subspace. Based on this approach, we provide a novel architecture that replaces the original embedding layer by a set of $k$ small layers that operate in parallel and are then recombined with a single fully-connected layer. Extensive experimental results on the GLUE benchmark yield networks that are both more accurate and smaller compared to the standard matrix factorization (SVD). For example, we further compress DistilBERT by reducing the size of the embedding layer by $40\%$ while incurring only a $0.5\%$ average drop in accuracy over all nine GLUE tasks, compared to a $2.8\%$ drop using the existing SVD approach. On RoBERTa we achieve $43\%$ compression of the embedding layer with less than a $0.8\%$ average drop in accuracy as compared to a $3\%$ drop previously. Open code for reproducing and extending our results is provided.