RECD：端到端深度学习建议模型培训基础架构的重复数据删除

论文标题

RECD：端到端深度学习建议模型培训基础架构的重复数据删除

RecD: Deduplication for End-to-End Deep Learning Recommendation Model Training Infrastructure

论文作者

Zhao, Mark, Choudhary, Dhruv, Tyagi, Devashish, Somani, Ajay, Kaplan, Max, Lin, Sung-Han, Pumma, Sarunya, Park, Jongsoo, Basant, Aarti, Agarwal, Niket, Wu, Carole-Jean, Kozyrakis, Christos

论文摘要

我们提出了RECD（建议重复数据删除），这是一套深度学习推荐模型（DLRM）培训管道中的端到端基础设施优化的套件。 RECD解决了由行业规模DLRM培训数据集固有的功能重复引起的巨大存储，预处理和培训间接费用。出现功能重复，因为DLRM数据集是由交互生成的。尽管每个用户会话都可以生成多个培训样本，但许多功能的值不会在这些样本中发生变化。我们演示了RECD如何在部署的培训管道中利用此属性，端到端。 RECD优化数据生成管道，以减少数据集存储和预处理资源需求，并在培训批次内最大化重复。 RECD引入了一种新的张量格式，inverseKeyedJaggedTensors（ikjts），以在每个批处理中重复删除特征值。我们展示了DLRM模型体系结构如何利用IKJTS极大地增加训练吞吐量。 RECD在行业规模的DLRM培训系统中，RECD分别提高了2.48倍，1.79倍和3.71倍的培训和预处理吞吐量和存储效率。

We present RecD (Recommendation Deduplication), a suite of end-to-end infrastructure optimizations across the Deep Learning Recommendation Model (DLRM) training pipeline. RecD addresses immense storage, preprocessing, and training overheads caused by feature duplication inherent in industry-scale DLRM training datasets. Feature duplication arises because DLRM datasets are generated from interactions. While each user session can generate multiple training samples, many features' values do not change across these samples. We demonstrate how RecD exploits this property, end-to-end, across a deployed training pipeline. RecD optimizes data generation pipelines to decrease dataset storage and preprocessing resource demands and to maximize duplication within a training batch. RecD introduces a new tensor format, InverseKeyedJaggedTensors (IKJTs), to deduplicate feature values in each batch. We show how DLRM model architectures can leverage IKJTs to drastically increase training throughput. RecD improves the training and preprocessing throughput and storage efficiency by up to 2.48x, 1.79x, and 3.71x, respectively, in an industry-scale DLRM training system.

下载PDF全文

下载文献需遵守相关版权规定

论文标题