用于科学应用的图形神经网络的操作级别性能基准测试

论文标题

用于科学应用的图形神经网络的操作级别性能基准测试

Operation-Level Performance Benchmarking of Graph Neural Networks for Scientific Applications

论文作者

Hosseini, Ryien, Simini, Filippo, Vishwanath, Venkatram

论文摘要

随着图形神经网络（GNNS）在科学机器学习中的受欢迎程度的提高，他们的培训和推理效率变得越来越重要。此外，整个深度学习领域正在朝着更广泛，更深层的网络趋向于越来越多的数据大小，以至于经常遇到硬件瓶颈。新兴的专业硬件平台为这个问题提供了令人兴奋的解决方案。在本文中，我们系统地介绍并选择了与GNN有关的低级操作，以用于在Pytorch几何软件框架中实施的科学计算。然后，这些在NVIDIA A100 GPU上进行了严格的基准测试，以用于几种输入值组合，包括张量稀疏性。然后，我们为每个操作分析这些结果。 At a high level, we conclude that on NVIDIA systems: (1) confounding bottlenecks such as memory inefficiency often dominate runtime costs moreso than data sparsity alone, (2) native Pytorch operations are often as or more competitive than their Pytorch Geometric equivalents, especially at low to moderate levels of input data sparsity, and (3) many operations central to state-of-the-art GNN architectures have little没有优化稀疏性。我们希望这些结果是那些在专门硬件上开发这些操作的人的基准，我们随后的分析有助于促进对这些操作的未来软件和基于硬件的优化，从而促进了整个GNN性能。

As Graph Neural Networks (GNNs) increase in popularity for scientific machine learning, their training and inference efficiency is becoming increasingly critical. Additionally, the deep learning field as a whole is trending towards wider and deeper networks, and ever increasing data sizes, to the point where hard hardware bottlenecks are often encountered. Emerging specialty hardware platforms provide an exciting solution to this problem. In this paper, we systematically profile and select low-level operations pertinent to GNNs for scientific computing implemented in the Pytorch Geometric software framework. These are then rigorously benchmarked on NVIDIA A100 GPUs for several various combinations of input values, including tensor sparsity. We then analyze these results for each operation. At a high level, we conclude that on NVIDIA systems: (1) confounding bottlenecks such as memory inefficiency often dominate runtime costs moreso than data sparsity alone, (2) native Pytorch operations are often as or more competitive than their Pytorch Geometric equivalents, especially at low to moderate levels of input data sparsity, and (3) many operations central to state-of-the-art GNN architectures have little to no optimization for sparsity. We hope that these results serve as a baseline for those developing these operations on specialized hardware and that our subsequent analysis helps to facilitate future software and hardware based optimizations of these operations and thus scalable GNN performance as a whole.

下载PDF全文

下载文献需遵守相关版权规定

论文标题