注意力头与计算机视觉中的变压器编码器的数量

论文标题

注意力头与计算机视觉中的变压器编码器的数量

Number of Attention Heads vs Number of Transformer-Encoders in Computer Vision

论文作者

Hrycej, Tomas, Bermeitinger, Bernhard, Handschuh, Siegfried

论文摘要

一方面确定适当数量的注意力头，另一方面，变压器编码器的数量是使用变压器体系结构进行计算机视觉（CV）任务的重要选择。计算实验证实了期望参数的总数必须满足过度确定的条件（即，约束数量大大超过了参数数量）。然后，可以预期良好的概括性能。这设置了可以选择头部数量和变压器数量的边界。如果可以假定上下文在要分类的图像中的作用很小，则有利于使用具有较少头部（例如一个或两个）的多个变压器。在分类其类可能在很大程度上取决于图像中上下文的对象（即补丁取决于其他补丁的含义）时，头部数量与变形金刚的含义同样重要。

Determining an appropriate number of attention heads on one hand and the number of transformer-encoders, on the other hand, is an important choice for Computer Vision (CV) tasks using the Transformer architecture. Computing experiments confirmed the expectation that the total number of parameters has to satisfy the condition of overdetermination (i.e., number of constraints significantly exceeding the number of parameters). Then, good generalization performance can be expected. This sets the boundaries within which the number of heads and the number of transformers can be chosen. If the role of context in images to be classified can be assumed to be small, it is favorable to use multiple transformers with a low number of heads (such as one or two). In classifying objects whose class may heavily depend on the context within the image (i.e., the meaning of a patch being dependent on other patches), the number of heads is equally important as that of transformers.

下载PDF全文

下载文献需遵守相关版权规定

论文标题