持续性初始化：时间序列预测的变压器体系结构的新颖调整

论文标题

持续性初始化：时间序列预测的变压器体系结构的新颖调整

Persistence Initialization: A novel adaptation of the Transformer architecture for Time Series Forecasting

论文作者

Haugsdal, Espen, Aune, Erlend, Ruocco, Massimiliano

论文摘要

时间序列预测是一个重要的问题，具有许多现实世界的应用。深层神经网络的合奏最近实现了令人印象深刻的预测准确性，但是在许多现实世界中，如此大的合奏是不切实际的。变压器模型已成功应用于各种具有挑战性的问题。我们建议对原始变压器体系结构进行新颖的改编，该架构的重点是时间序列预测的任务，称为持久性初始化。该模型通过使用与残留跳过连接的乘法门控机制初始化为幼稚的持久性模型。我们使用具有REZERO标准化和旋转位置编码的解码器变压器，但适应适用于任何自动回归神经网络模型。我们评估了在具有挑战性的M4数据集上提出的架构，与基于合奏的方法相比，取得了竞争性能。我们还将最近提议的变压器模型进行比较，以预测时间序列，显示了M4数据集上的出色性能。广泛的消融研究表明，持久性初始化会导致更好的性能和更快的收敛性。随着模型的大小增加，只有我们提出的绩效适应性增长的模型。我们还进行了一项额外的消融研究，以确定正常化和位置编码的选择的重要性，并发现旋转编码的使用和REZERO归一化对于良好的预测性能至关重要。

Time series forecasting is an important problem, with many real world applications. Ensembles of deep neural networks have recently achieved impressive forecasting accuracy, but such large ensembles are impractical in many real world settings. Transformer models been successfully applied to a diverse set of challenging problems. We propose a novel adaptation of the original Transformer architecture focusing on the task of time series forecasting, called Persistence Initialization. The model is initialized as a naive persistence model by using a multiplicative gating mechanism combined with a residual skip connection. We use a decoder Transformer with ReZero normalization and Rotary positional encodings, but the adaptation is applicable to any auto-regressive neural network model. We evaluate our proposed architecture on the challenging M4 dataset, achieving competitive performance compared to ensemble based methods. We also compare against existing recently proposed Transformer models for time series forecasting, showing superior performance on the M4 dataset. Extensive ablation studies show that Persistence Initialization leads to better performance and faster convergence. As the size of the model increases, only the models with our proposed adaptation gain in performance. We also perform an additional ablation study to determine the importance of the choice of normalization and positional encoding, and find both the use of Rotary encodings and ReZero normalization to be essential for good forecasting performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题