论文标题
提出问题格式独立数值推理:一组先决条件
Towards Question Format Independent Numerical Reasoning: A Set of Prerequisite Tasks
论文作者
论文摘要
数值推理对于准确理解世界通常很重要。最近,已经提出了几种特定格式的数据集,例如自然语言推理(NLI),阅读理解(RC)和问题答录(QA)的数值推理。还提出了几种针对这些数据集的格式特异性模型和架构。但是,在执行问题格式的独立数值推理时,存在强烈的基准测试,因为(i)我们想要教的数值推理能力不是由问题格式控制的数值推理能力,(ii)数值推理技术具有最佳的应用程序,必须在单个表单上处理语言和理由,而不是独立于单个表单。为了实现这一目标,我们介绍了NumberGame,这是一种多方面的基准测试,以评估跨八种不同格式的数值推理任务的模型性能。我们在汇编中添加了四种现有的问题类型。我们添加的两种新类型是关于需要外部数值知识,常识性知识和领域知识的问题。为了构建一个更实用的数值推理系统,数字游戏要求超出数值推理的四个功能:(i)直接从数据(ii)查找中间通用格式中直接检测问题格式,可以将每种格式都可以转换为(iii)包含常识知识(IV)处理跨格式的数据不平衡。我们构建了几个基线,包括使用作弊地图基于知识狩猎的新模型。但是,与人类基线相比,所有基线的表现都很差,表明我们的基准的硬度。我们的工作推动了通用系统开发方面的最新进展,证明了这些未经探索的任务的范围。
Numerical reasoning is often important to accurately understand the world. Recently, several format-specific datasets have been proposed, such as numerical reasoning in the settings of Natural Language Inference (NLI), Reading Comprehension (RC), and Question Answering (QA). Several format-specific models and architectures in response to those datasets have also been proposed. However, there exists a strong need for a benchmark which can evaluate the abilities of models, in performing question format independent numerical reasoning, as (i) the numerical reasoning capabilities we want to teach are not controlled by question formats, (ii) for numerical reasoning technology to have the best possible application, it must be able to process language and reason in a way that is not exclusive to a single format, task, dataset or domain. In pursuit of this goal, we introduce NUMBERGAME, a multifaceted benchmark to evaluate model performance across numerical reasoning tasks of eight diverse formats. We add four existing question types in our compilation. Two of the new types we add are about questions that require external numerical knowledge, commonsense knowledge and domain knowledge. For building a more practical numerical reasoning system, NUMBERGAME demands four capabilities beyond numerical reasoning: (i) detecting question format directly from data (ii) finding intermediate common format to which every format can be converted (iii) incorporating commonsense knowledge (iv) handling data imbalance across formats. We build several baselines, including a new model based on knowledge hunting using a cheatsheet. However, all baselines perform poorly in contrast to the human baselines, indicating the hardness of our benchmark. Our work takes forward the recent progress in generic system development, demonstrating the scope of these under-explored tasks.