论文标题

多跳阅读理解模型了解日期信息的程度如何?

How Well Do Multi-hop Reading Comprehension Models Understand Date Information?

论文作者

Ho, Xanh, Sugawara, Saku, Aizawa, Akiko

论文摘要

已经提出了几个多跳阅读理解数据集,以解决推理捷径的问题,通过该捷径可以在不执行多跳推理的情况下回答问题。但是,在找到比较问题的答案时,多跳模型在逐步推理的能力尚不清楚。目前尚不清楚有关内部推理过程的问题如何有用,可用于培训和评估问题的解决方案(QA)系统。为了精确评估模型,我们首先提出一个数据集,\ textit {hieradate},除了主要问题外,还有三个探测任务:提取,推理和鲁棒性。我们的数据集是通过增强两个以前的多跳数据集(HotPotQA和2Wikimultihopqa)创建的,该数据集的重点是涉及比较和数值推理的日期信息的多跳问题。然后,我们评估现有模型了解日期信息的能力。我们的实验结果表明,即使在日期比较和数字减法任务中表现良好时,多跳模型也无法减去两个日期。其他结果表明,我们的探测问题可以帮助提高主要质量检查任务上模型的性能(例如,+10.3 f1),并且我们的数据集可用于数据增强,以提高模型的鲁棒性。

Several multi-hop reading comprehension datasets have been proposed to resolve the issue of reasoning shortcuts by which questions can be answered without performing multi-hop reasoning. However, the ability of multi-hop models to perform step-by-step reasoning when finding an answer to a comparison question remains unclear. It is also unclear how questions about the internal reasoning process are useful for training and evaluating question-answering (QA) systems. To evaluate the model precisely in a hierarchical manner, we first propose a dataset, \textit{HieraDate}, with three probing tasks in addition to the main question: extraction, reasoning, and robustness. Our dataset is created by enhancing two previous multi-hop datasets, HotpotQA and 2WikiMultiHopQA, focusing on multi-hop questions on date information that involve both comparison and numerical reasoning. We then evaluate the ability of existing models to understand date information. Our experimental results reveal that the multi-hop models do not have the ability to subtract two dates even when they perform well in date comparison and number subtraction tasks. Other results reveal that our probing questions can help to improve the performance of the models (e.g., by +10.3 F1) on the main QA task and our dataset can be used for data augmentation to improve the robustness of the models.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源