更深入地融合！具有图层潜在变量推理的变性变压器，用于文本生成

论文标题

更深入地融合！具有图层潜在变量推理的变性变压器，用于文本生成

Fuse It More Deeply! A Variational Transformer with Layer-Wise Latent Variable Inference for Text Generation

论文作者

Hu, Jinyi, Yi, Xiaoyuan, Li, Wenhao, Sun, Maosong, Xie, Xing

论文摘要

在过去的几年中，在各种文本生成任务中见证了各种自动编码器的优势。但是，由于文本的顺序性质，自动回归解码器倾向于忽略潜在变量，然后降低到简单的语言模型，称为KL消失的问题，当VAE与基于变压器的结构结合时，这将进一步恶化。为了改善这个问题，我们提出了一种新型变化变压器框架Della。德拉（Della）从较低层的层中得知一系列层次的潜在变量，每个变量都可以通过低级张量产品与隐藏状态紧密结合。通过这种方式，Della强迫这些后部潜在变量将其与整个计算路径深入融合，从而结合了更多信息。从理论上讲，我们可以将我们的方法视为纠缠潜在变量，以避免后验信息通过层减少，从而使DELLA即使没有任何退火或阈值技巧，也可以使Della获得更高的非零KL值。与多个强大的基线相比，对四个无条件和三个条件生成任务的实验表明，Della可以更好地减轻KL消失并改善质量和多样性。

The past several years have witnessed Variational Auto-Encoder's superiority in various text generation tasks. However, due to the sequential nature of the text, auto-regressive decoders tend to ignore latent variables and then reduce to simple language models, known as the KL vanishing problem, which would further deteriorate when VAE is combined with Transformer-based structures. To ameliorate this problem, we propose DELLA, a novel variational Transformer framework. DELLA learns a series of layer-wise latent variables with each inferred from those of lower layers and tightly coupled with the hidden states by low-rank tensor product. In this way, DELLA forces these posterior latent variables to be fused deeply with the whole computation path and hence incorporate more information. We theoretically demonstrate that our method can be regarded as entangling latent variables to avoid posterior information decrease through layers, enabling DELLA to get higher non-zero KL values even without any annealing or thresholding tricks. Experiments on four unconditional and three conditional generation tasks show that DELLA could better alleviate KL vanishing and improve both quality and diversity compared to several strong baselines.

下载PDF全文

下载文献需遵守相关版权规定

论文标题