论文标题
训练平坦,然后压缩:清晰度感知最小化可以学习更可压缩的模型
Train Flat, Then Compress: Sharpness-Aware Minimization Learns More Compressible Models
论文作者
论文摘要
通过参数修剪,量化或蒸馏的模型压缩最近已获得流行,作为减少NLP现代深神经网络模型的计算要求的一种方法。受到先前的作品的启发,建议在更简单,更具概括的模型与更广泛的损失盆地的模型之间建立联系,我们假设对平面最小值进行优化应导致更简单的参数化,从而更加可压缩的模型。我们建议将清晰度感知最小化(SAM)与各种特定于任务的模型压缩方法相结合,包括迭代幅度修剪(IMP),结构化修剪,具有蒸馏目标和训练后动态量化。从经验上讲,我们表明,在微调BERT模型时,与Vanilla Adam相比,与Vanilla Adam相比,优化对扁平化最小值的优化始终导致参数的可压缩性,而胶水文本分类的准确性几乎没有或没有损失。此外,山姆在IMP期间发现了优越的获胜票,即1)适合香草·亚当(Vanilla Adam)优化,以及2)在任务中更有效地转移。
Model compression by way of parameter pruning, quantization, or distillation has recently gained popularity as an approach for reducing the computational requirements of modern deep neural network models for NLP. Inspired by prior works suggesting a connection between simpler, more generalizable models and those that lie within wider loss basins, we hypothesize that optimizing for flat minima should lead to simpler parameterizations and thus more compressible models. We propose to combine sharpness-aware minimization (SAM) with various task-specific model compression methods, including iterative magnitude pruning (IMP), structured pruning with a distillation objective, and post-training dynamic quantization. Empirically, we show that optimizing for flatter minima consistently leads to greater compressibility of parameters compared to vanilla Adam when fine-tuning BERT models, with little to no loss in accuracy on the GLUE text classification and SQuAD question answering benchmarks. Moreover, SAM finds superior winning tickets during IMP that 1) are amenable to vanilla Adam optimization, and 2) transfer more effectively across tasks.