ROMQA：鲁棒，多个证据的多个多回答问题的基准回答

论文标题

ROMQA：鲁棒，多个证据的多个多回答问题的基准回答

RoMQA: A Benchmark for Robust, Multi-evidence, Multi-answer Question Answering

论文作者

Zhong, Victor, Shi, Weijia, Yih, Wen-tau, Zettlemoyer, Luke

论文摘要

我们介绍了Romqa，这是强大的，多传播，多回答问题回答（QA）的第一个基准。 ROMQA包含从Wikidata知识图挖掘出的相关约束的问题集群。 ROMQA通过在每个问题群集中测量最差的性能来评估QA模型的鲁棒性，以改变限制。与先前的QA数据集相比，ROMQA有更多人为编写的问题，这些问题需要超过更多证据文本的推理，并且平均而言有更多正确的答案。此外，人类注释者对ROMQA的评价更为自然或可能被人们询问。我们以零拍，很少的和微调的设置评估了最新的大语言模型，并发现Romqa具有挑战性：零射击和很少的射击模型的性能与天真的基线相似，而监督的检索方法的性能远低于金证据的上限。此外，现有模型对问题约束的变化不是可靠的，但是可以通过调整相关问题的簇来使其更强大。我们的结果表明，ROMQA是大型语言模型的具有挑战性的基准，并提供了可量化的测试来构建更强大的QA方法。

We introduce RoMQA, the first benchmark for robust, multi-evidence, multi-answer question answering (QA). RoMQA contains clusters of questions that are derived from related constraints mined from the Wikidata knowledge graph. RoMQA evaluates robustness of QA models to varying constraints by measuring worst-case performance within each question cluster. Compared to prior QA datasets, RoMQA has more human-written questions that require reasoning over more evidence text and have, on average, many more correct answers. In addition, human annotators rate RoMQA questions as more natural or likely to be asked by people. We evaluate state-of-the-art large language models in zero-shot, few-shot, and fine-tuning settings, and find that RoMQA is challenging: zero-shot and few-shot models perform similarly to naive baselines, while supervised retrieval methods perform well below gold evidence upper bounds. Moreover, existing models are not robust to variations in question constraints, but can be made more robust by tuning on clusters of related questions. Our results show that RoMQA is a challenging benchmark for large language models, and provides a quantifiable test to build more robust QA methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题