论文标题
Merlin Hugectr:GPU加速推荐的系统培训和推理
Merlin HugeCTR: GPU-accelerated Recommender System Training and Inference
论文作者
论文摘要
在这次演讲中,我们介绍了Merlin Hugectr。 Merlin Hugectr是开源的GPU加速集成框架,用于点击率估算。它优化了训练和推理,同时通过模型平行的嵌入和数据并行的神经网络促进模型训练。特别是,Merlin Hugectr将嵌入缓存的高性能GPU与层次存储架构结合在一起,以实现用于在线模型推理任务的嵌入式嵌入式的低延迟检索。在MLPERF V1.0 DLRM模型训练基准中,Merlin Hugectr在单个DGX A100(8X A100)上的加速度高达24.6倍,在4x4 socket CPU节点(4x4x28核心)上,Pytorch上的Pytorch上的加速度达到了速度。 Merlin Hugectr还可以利用多节点环境来进一步加速培训。自2021年底以来,Merlin Hugectr还具有层次参数服务器(HPS),并通过NVIDIA TRITON服务器框架支持部署,以利用GPU的计算能力来用于高速推荐模型推断。使用此HPS,Merlin Hugectr用户可以在CPU基线实现上实现5〜62X的加速度(批量尺寸依赖),并大大减少其端到端推断潜伏期。
In this talk, we introduce Merlin HugeCTR. Merlin HugeCTR is an open source, GPU-accelerated integration framework for click-through rate estimation. It optimizes both training and inference, whilst enabling model training at scale with model-parallel embeddings and data-parallel neural networks. In particular, Merlin HugeCTR combines a high-performance GPU embedding cache with an hierarchical storage architecture, to realize low-latency retrieval of embeddings for online model inference tasks. In the MLPerf v1.0 DLRM model training benchmark, Merlin HugeCTR achieves a speedup of up to 24.6x on a single DGX A100 (8x A100) over PyTorch on 4x4-socket CPU nodes (4x4x28 cores). Merlin HugeCTR can also take advantage of multi-node environments to accelerate training even further. Since late 2021, Merlin HugeCTR additionally features a hierarchical parameter server (HPS) and supports deployment via the NVIDIA Triton server framework, to leverage the computational capabilities of GPUs for high-speed recommendation model inference. Using this HPS, Merlin HugeCTR users can achieve a 5~62x speedup (batch size dependent) for popular recommendation models over CPU baseline implementations, and dramatically reduce their end-to-end inference latency.