论文标题

使用缓存的混合精液嵌入

Mixed-Precision Embedding Using a Cache

论文作者

Yang, Jie Amy, Huang, Jianyu, Park, Jongsoo, Tang, Ping Tak Peter, Tulloch, Andrew

论文摘要

在推荐系统中,从业人员观察到,嵌入表及其尺寸的数量增加通常会导致模型性能的显着改善。鉴于这些模型对主要互联网公司的业务重要性,嵌入表个性化任务的表已逐渐发展到Terabyte量表,并继续以显着的速度增长。同时,这些大规模模型通常经过GPU训练,其中高性能记忆是一种稀缺的资源,从而激发了许多在训练过程中嵌入桌子压缩的工作。我们提出了一种新颖的更改,以使用缓存存储器体系结构嵌入表格,在该架构中,嵌入式中的大多数行被低精度训练,并且最常见或最近访问的行缓存和训练的行完全精确。所提出的建筑变化与标准的精度还原和计算机算术技术(例如量化和随机舍入)结合使用。对于使用Criteo-Kaggle数据集运行的开源深度学习推荐模型(DLRM),我们使用INT8精度嵌入表和全精度的高速缓存,其大小为嵌入式表的5%,同时保持准确性。对于工业规模的模型和数据集,我们在保持准确性的同时,通过INT4精度和缓存尺寸为1%的1%的嵌入式表和降低,我们实现了更高> 7倍的内存,以及通过减少GPU到主机数据传输的端到端训练速度16%。

In recommendation systems, practitioners observed that increase in the number of embedding tables and their sizes often leads to significant improvement in model performances. Given this and the business importance of these models to major internet companies, embedding tables for personalization tasks have grown to terabyte scale and continue to grow at a significant rate. Meanwhile, these large-scale models are often trained with GPUs where high-performance memory is a scarce resource, thus motivating numerous work on embedding table compression during training. We propose a novel change to embedding tables using a cache memory architecture, where the majority of rows in an embedding is trained in low precision, and the most frequently or recently accessed rows cached and trained in full precision. The proposed architectural change works in conjunction with standard precision reduction and computer arithmetic techniques such as quantization and stochastic rounding. For an open source deep learning recommendation model (DLRM) running with Criteo-Kaggle dataset, we achieve 3x memory reduction with INT8 precision embedding tables and full-precision cache whose size are 5% of the embedding tables, while maintaining accuracy. For an industrial scale model and dataset, we achieve even higher >7x memory reduction with INT4 precision and cache size 1% of embedding tables, while maintaining accuracy, and 16% end-to-end training speedup by reducing GPU-to-host data transfers.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源