语言模型压缩与加权低级分解

论文标题

语言模型压缩与加权低级分解

Language model compression with weighted low-rank factorization

论文作者

Hsu, Yen-Chang, Hua, Ting, Chang, Sungen, Lou, Qian, Shen, Yilin, Jin, Hongxia

论文摘要

将大型矩阵分解为小矩阵是模型压缩的流行策略。奇异值分解（SVD）在这种压缩策略中起着至关重要的作用，近似具有较少参数的学习矩阵。但是，SVD最大程度地减少了平方误差以重建原始矩阵而不衡量参数的重要性，这可能会给那些影响任务准确性的人带来更大的重建误差。换句话说，SVD的优化目标与受过训练的模型的任务准确性不符。我们通过引入Fisher信息来权衡影响模型预测的参数的重要性来分析以前未开发的问题，进行观察并解决该问题。这个想法导致了我们的方法：Fisher加权SVD（FWSVD）。尽管我们方法的分解矩阵并没有导致较小的重建错误，但我们发现我们由此产生的任务准确性更接近原始模型的性能。我们通过基于变压器的语言模型进行分析，显示了我们的加权SVD很大程度上减轻了不匹配的优化目标，并可以以更高的压缩率维持模型性能。我们的方法可以直接压缩特定于任务的模型，同时比需要昂贵的模型预训练的其他紧凑型模型策略更好。此外，对压缩模型的评估表明，我们的方法可以进一步降低9％至30％的参数，对任务准确性产生不大的影响。

Factorizing a large matrix into small matrices is a popular strategy for model compression. Singular value decomposition (SVD) plays a vital role in this compression strategy, approximating a learned matrix with fewer parameters. However, SVD minimizes the squared error toward reconstructing the original matrix without gauging the importance of the parameters, potentially giving a larger reconstruction error for those who affect the task accuracy more. In other words, the optimization objective of SVD is not aligned with the trained model's task accuracy. We analyze this previously unexplored problem, make observations, and address it by introducing Fisher information to weigh the importance of parameters affecting the model prediction. This idea leads to our method: Fisher-Weighted SVD (FWSVD). Although the factorized matrices from our approach do not result in smaller reconstruction errors, we find that our resulting task accuracy is much closer to the original model's performance. We perform analysis with the transformer-based language models, showing our weighted SVD largely alleviates the mismatched optimization objectives and can maintain model performance with a higher compression rate. Our method can directly compress a task-specific model while achieving better performance than other compact model strategies requiring expensive model pre-training. Moreover, the evaluation of compressing an already compact model shows our method can further reduce 9% to 30% parameters with an insignificant impact on task accuracy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题