论文标题
层或表示空间:是什么使基于BERT的评估指标可靠?
Layer or Representation Space: What makes BERT-based Evaluation Metrics Robust?
论文作者
论文摘要
对文本生成的最新基于嵌入的评估指标的评估主要是基于衡量其与标准基准评估的相关性。但是,这些基准主要是从相似的域到用于嵌入词嵌入的域。这引起了人们对将基于嵌入的指标(缺乏)概括为新的和嘈杂的域的(缺乏)概括的词汇,这些指标与词汇不同。在本文中,我们研究了Bertscore的鲁棒性,BertScore是文本生成最受欢迎的基于嵌入的指标之一。我们表明,(a)基于嵌入的指标,与人类在标准基准上的评估具有最高的相关性,如果输入噪声或未知代币的数量增加,则可以具有最低的相关性。从预告片的模型的第一层。
The evaluation of recent embedding-based evaluation metrics for text generation is primarily based on measuring their correlation with human evaluations on standard benchmarks. However, these benchmarks are mostly from similar domains to those used for pretraining word embeddings. This raises concerns about the (lack of) generalization of embedding-based metrics to new and noisy domains that contain a different vocabulary than the pretraining data. In this paper, we examine the robustness of BERTScore, one of the most popular embedding-based metrics for text generation. We show that (a) an embedding-based metric that has the highest correlation with human evaluations on a standard benchmark can have the lowest correlation if the amount of input noise or unknown tokens increases, (b) taking embeddings from the first layer of pretrained models improves the robustness of all metrics, and (c) the highest robustness is achieved when using character-level embeddings, instead of token-based embeddings, from the first layer of the pretrained model.