高性能和基于ARM的多核处理器的便携式卷积操作员

论文标题

高性能和基于ARM的多核处理器的便携式卷积操作员

High Performance and Portable Convolution Operators for ARM-based Multicore Processors

论文作者

Juan, Pablo San, Castelló, Adrián, Dolz, Manuel F., Alonso-Jordá, Pedro, Quintana-Ortí, Enrique S.

论文摘要

卷积神经网络对许多人工智能任务的相当大影响导致为这种类型的网络中存在的卷积操作员开发了各种高性能算法。其中一种方法利用了\ imcol变换，然后是一般矩阵乘法（GEMM），以便在许多线性代数库中利用GEMM内核的高度优化实现。这种方法的主要问题是1）托管IM2COL变换生成的中间矩阵所需的大型内存工作空间； 2）执行IM2COL变换的时间，对于复杂的神经网络而言，这不是可以忽略的。本文根据GEMM内核的BLIS实现了便携式高性能卷积算法，该算法通过利用BLIS结构来避免使用中间内存。此外，提出的算法消除了显式IM2COL变换的成本，同时保持了BLIS中GEMM的基础实现的可移植性和性能。

The considerable impact of Convolutional Neural Networks on many Artificial Intelligence tasks has led to the development of various high performance algorithms for the convolution operator present in this type of networks. One of these approaches leverages the \imcol transform followed by a general matrix multiplication (GEMM) in order to take advantage of the highly optimized realizations of the GEMM kernel in many linear algebra libraries. The main problems of this approach are 1) the large memory workspace required to host the intermediate matrices generated by the IM2COL transform; and 2) the time to perform the IM2COL transform, which is not negligible for complex neural networks. This paper presents a portable high performance convolution algorithm based on the BLIS realization of the GEMM kernel that avoids the use of the intermediate memory by taking advantage of the BLIS structure. In addition, the proposed algorithm eliminates the cost of the explicit IM2COL transform, while maintaining the portability and performance of the underlying realization of GEMM in BLIS.

下载PDF全文

下载文献需遵守相关版权规定

论文标题