论文标题

Qascore-无监督的无参考度量标准,用于问题生成评估

QAScore -- An Unsupervised Unreferenced Metric for the Question Generation Evaluation

论文作者

Ji, Tianbo, Lyu, Chenyang, Jones, Gareth, Zhou, Liting, Graham, Yvette

论文摘要

问题生成(QG)旨在自动化撰写段落中的问题的任务,并在段落中找到一组选定的答案。近年来,引入神经产生模型已导致自动产生的问题的大幅改善,尤其是与使用手动制作的启发式方法的传统方法相比。但是,QG评估中通常应用的指标因其与人类判断的低同意而受到批评。因此,我们提出了一个新的无参考评估指标,该指标有潜力为评估QG系统(称为Qascore)提供更好的机制。 Qascore没有根据语言模型根据语言模型可以正确生成该问题答案中蒙面的单词来评估跨熵来评估问题的语言模型来最大程度地与人类判断的相关性。此外,我们对QG评估进行了新的众包人类评估实验,以研究Qascore和其他指标与人类判断的相关性。实验表明,与现有的传统基于单词的基于单词的指标(例如BLEU和Rouge)以及现有的基于验证的基于模型的度量bertscore相比,Qascore与我们提出的人类评估方法的结果具有更强的相关性。

Question Generation (QG) aims to automate the task of composing questions for a passage with a set of chosen answers found within the passage. In recent years, the introduction of neural generation models has resulted in substantial improvements of automatically generated questions in terms of quality, especially compared to traditional approaches that employ manually crafted heuristics. However, the metrics commonly applied in QG evaluations have been criticized for their low agreement with human judgement. We therefore propose a new reference-free evaluation metric that has the potential to provide a better mechanism for evaluating QG systems, called QAScore. Instead of fine-tuning a language model to maximize its correlation with human judgements, QAScore evaluates a question by computing the cross entropy according to the probability that the language model can correctly generate the masked words in the answer to that question. Furthermore, we conduct a new crowd-sourcing human evaluation experiment for the QG evaluation to investigate how QAScore and other metrics can correlate with human judgements. Experiments show that QAScore obtains a stronger correlation with the results of our proposed human evaluation method compared to existing traditional word-overlap-based metrics such as BLEU and ROUGE, as well as the existing pretrained-model-based metric BERTScore.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源