Merlin Hugectr：GPU加速推荐的系统培训和推理

论文标题

Merlin Hugectr：GPU加速推荐的系统培训和推理

Merlin HugeCTR: GPU-accelerated Recommender System Training and Inference

论文作者

Wang, Joey, Wei, Yingcan, Lee, Minseok, Langer, Matthias, Yu, Fan, Liu, Jie, Liu, Alex, Abel, Daniel, Guo, Gems, Dong, Jianbing, Shi, Jerry, Li, Kunlun

论文摘要

在这次演讲中，我们介绍了Merlin Hugectr。 Merlin Hugectr是开源的GPU加速集成框架，用于点击率估算。它优化了训练和推理，同时通过模型平行的嵌入和数据并行的神经网络促进模型训练。特别是，Merlin Hugectr将嵌入缓存的高性能GPU与层次存储架构结合在一起，以实现用于在线模型推理任务的嵌入式嵌入式的低延迟检索。在MLPERF V1.0 DLRM模型训练基准中，Merlin Hugectr在单个DGX A100（8X A100）上的加速度高达24.6倍，在4x4 socket CPU节点（4x4x28核心）上，Pytorch上的Pytorch上的加速度达到了速度。 Merlin Hugectr还可以利用多节点环境来进一步加速培训。自2021年底以来，Merlin Hugectr还具有层次参数服务器（HPS），并通过NVIDIA TRITON服务器框架支持部署，以利用GPU的计算能力来用于高速推荐模型推断。使用此HPS，Merlin Hugectr用户可以在CPU基线实现上实现5〜62X的加速度（批量尺寸依赖），并大大减少其端到端推断潜伏期。

In this talk, we introduce Merlin HugeCTR. Merlin HugeCTR is an open source, GPU-accelerated integration framework for click-through rate estimation. It optimizes both training and inference, whilst enabling model training at scale with model-parallel embeddings and data-parallel neural networks. In particular, Merlin HugeCTR combines a high-performance GPU embedding cache with an hierarchical storage architecture, to realize low-latency retrieval of embeddings for online model inference tasks. In the MLPerf v1.0 DLRM model training benchmark, Merlin HugeCTR achieves a speedup of up to 24.6x on a single DGX A100 (8x A100) over PyTorch on 4x4-socket CPU nodes (4x4x28 cores). Merlin HugeCTR can also take advantage of multi-node environments to accelerate training even further. Since late 2021, Merlin HugeCTR additionally features a hierarchical parameter server (HPS) and supports deployment via the NVIDIA Triton server framework, to leverage the computational capabilities of GPUs for high-speed recommendation model inference. Using this HPS, Merlin HugeCTR users can achieve a 5~62x speedup (batch size dependent) for popular recommendation models over CPU baseline implementations, and dramatically reduce their end-to-end inference latency.

下载PDF全文

下载文献需遵守相关版权规定

论文标题