论文标题
标签物:朝着独立于词汇的预训练的变压器
HashFormers: Towards Vocabulary-independent Pre-trained Transformers
论文作者
论文摘要
基于变压器的预训练的语言模型是词汇依赖性的,默认情况下映射每个令牌与其相应的嵌入。这一一对一的映射导致嵌入占有大量内存(即数百万参数)并随词汇大小线性生长的矩阵。先前的关于智障变压器的工作可以在不使用局部敏感的形态信息上嵌入矩阵的情况下动态生成令牌嵌入。这些嵌入随后被馈入变压器层以进行文本分类。但是,这些方法尚未预先训练。受这项工作的启发,我们提出了一个新的Hashformers,这是一个新的独立于词汇量的预训练的变压器,它支持无限制的词汇(即语料库中所有可能的令牌),鉴于较小的固定尺寸嵌入矩阵。我们通过首先引入计算廉价的哈希功能来实现这一目标,从而将单个令牌融合到嵌入中。我们还提出了三种不需要嵌入矩阵的变体,从而进一步降低了内存要求。我们从经验上证明,与标准预训练的变压器相比,哈希构象体具有更高的内存效率,同时在多个文本分类任务上进行微调时,可以实现可比较的预测性能。例如,我们最有效的Hashorminger变体具有仅使用99.1K参数来表示嵌入的性能降解(0.4 \%),而最先进的模型的12.38亿参数。
Transformer-based pre-trained language models are vocabulary-dependent, mapping by default each token to its corresponding embedding. This one-to-one mapping results into embedding matrices that occupy a lot of memory (i.e. millions of parameters) and grow linearly with the size of the vocabulary. Previous work on on-device transformers dynamically generate token embeddings on-the-fly without embedding matrices using locality-sensitive hashing over morphological information. These embeddings are subsequently fed into transformer layers for text classification. However, these methods are not pre-trained. Inspired by this line of work, we propose HashFormers, a new family of vocabulary-independent pre-trained transformers that support an unlimited vocabulary (i.e. all possible tokens in a corpus) given a substantially smaller fixed-sized embedding matrix. We achieve this by first introducing computationally cheap hashing functions that bucket together individual tokens to embeddings. We also propose three variants that do not require an embedding matrix at all, further reducing the memory requirements. We empirically demonstrate that HashFormers are more memory efficient compared to standard pre-trained transformers while achieving comparable predictive performance when fine-tuned on multiple text classification tasks. For example, our most efficient HashFormer variant has a negligible performance degradation (0.4\% on GLUE) using only 99.1K parameters for representing the embeddings compared to 12.3-38M parameters of state-of-the-art models.