学习比较开放域自然语言生成模型的更好培训和评估

论文标题

学习比较开放域自然语言生成模型的更好培训和评估

Learning to Compare for Better Training and Evaluation of Open Domain Natural Language Generation Models

论文作者

Zhou, Wangchunshu, Xu, Ke

论文摘要

对开放域自然语言产生（NLG）模型的自动评估仍然是一个挑战，在某些情况下，诸如BLEU和困惑之类的广泛使用的指标可能会误导。在我们的论文中，我们建议通过通过微调伯特（Bert）进行比较一对生成的句子来评估自然语言生成模型，后者已被证明具有良好的自然语言理解能力。我们还建议通过样本级比较结果与技能评级系统评估NLG模型的模型级质量。虽然能够以完全自我监督的方式接受培训，但我们的模型可以通过少量的人类偏好注释进行微调，以更好地模仿人类的判断力。除了评估训练有素的模型外，我们还建议将模型应用于培训期间的绩效指标，以更好地调整和早期训练。我们对故事产生和聊天对话响应的产生评估我们的方法。实验结果表明，与以前的自动评估方法相比，我们的模型与人类偏好更好。用拟议的度量训练在人类评估中得出更好的表现，这进一步证明了该模型的有效性。

Automated evaluation of open domain natural language generation (NLG) models remains a challenge and widely used metrics such as BLEU and Perplexity can be misleading in some cases. In our paper, we propose to evaluate natural language generation models by learning to compare a pair of generated sentences by fine-tuning BERT, which has been shown to have good natural language understanding ability. We also propose to evaluate the model-level quality of NLG models with sample-level comparison results with skill rating system. While able to be trained in a fully self-supervised fashion, our model can be further fine-tuned with a little amount of human preference annotation to better imitate human judgment. In addition to evaluating trained models, we propose to apply our model as a performance indicator during training for better hyperparameter tuning and early-stopping. We evaluate our approach on both story generation and chit-chat dialogue response generation. Experimental results show that our model correlates better with human preference compared with previous automated evaluation approaches. Training with the proposed metric yields better performance in human evaluation, which further demonstrates the effectiveness of the proposed model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题