论文标题
RQUGE:通过回答问题来评估问题生成的无参考度量
RQUGE: Reference-Free Metric for Evaluating Question Generation by Answering the Question
论文作者
论文摘要
现有的指标用于评估自动产生的问题的质量,例如BLEU,Rouge,Bertscore和Bleurt比较参考和预测问题,当候选人与参考问题之间存在相当大的词汇重叠或语义相似性时,就提供了很高的分数。这种方法有两个主要的缺点。首先,我们需要昂贵的人提供的参考问题。其次,它会惩罚与参考问题可能没有高词汇或语义相似性的有效问题。在本文中,我们根据候选人问题的答复性提出了一个新的指标,即rquge。该度量由提问和跨度得分子模块组成,使用现有文献的预培训模型,因此可以在没有任何进一步培训的情况下使用它。我们证明,RQUGE与人类判断的相关性更高,而无需依靠参考问题。此外,RQUGE被证明对几个对抗性腐败更为强大。此外,我们说明我们可以通过微调问题生成模型生成的合成数据并重新排名为Rquge来显着改善质量检查模型在室外数据集上的性能。
Existing metrics for evaluating the quality of automatically generated questions such as BLEU, ROUGE, BERTScore, and BLEURT compare the reference and predicted questions, providing a high score when there is a considerable lexical overlap or semantic similarity between the candidate and the reference questions. This approach has two major shortcomings. First, we need expensive human-provided reference questions. Second, it penalises valid questions that may not have high lexical or semantic similarity to the reference questions. In this paper, we propose a new metric, RQUGE, based on the answerability of the candidate question given the context. The metric consists of a question-answering and a span scorer modules, using pre-trained models from existing literature, thus it can be used without any further training. We demonstrate that RQUGE has a higher correlation with human judgment without relying on the reference question. Additionally, RQUGE is shown to be more robust to several adversarial corruptions. Furthermore, we illustrate that we can significantly improve the performance of QA models on out-of-domain datasets by fine-tuning on synthetic data generated by a question generation model and re-ranked by RQUGE.