理论上基于评估机器常识的基准

论文标题

理论上基于评估机器常识的基准

A Theoretically Grounded Benchmark for Evaluating Machine Commonsense

论文作者

Santos, Henrique, Shen, Ke, Mulvehill, Alice M., Razeghi, Yasaman, McGuinness, Deborah L., Kejriwal, Mayank

论文摘要

具有常识性推理（CSR）能力的编程机器是人工智能界的长期挑战。当前的CSR基准测试使用多项选择（在相对较少的情况下，生成的）提问实例来评估机器常识。基于变压器的语言表示模型的最新进展表明，现有基准取得了很大进展。但是，尽管目前存在数十个CSR基准，并且正在增长，但尚不清楚全面的常识能力套件已被系统地评估。此外，人们对语言模型是否“适合”基准数据集的培训分区存在疑问，因为它可以通过微妙但规范上无关（至少对于CSR而言），这是在测试分区上实现良好性能的统计功能。为了应对这些挑战，我们提出了一个称为理论上的常识性推理（TG-CSR）的基准，该基准也基于歧视性的问题回答，但旨在评估常见方面的各个方面，例如时空，时间和世界国家。 TG-CSR基于Gordon和Hobbs首先提出的常识性类别的子集。该基准还设计为几乎没有射击（将来，零射），只提供了少数培训和验证示例。该报告讨论了基准的结构和构建。初步结果表明，即使对于为歧视性企业社会责任范围的问题回答任务设计的高级语言表示模型，基准也是具有挑战性的。基准访问和排行榜：https：//codalab.lisn.upsaclay.fr/competitions/3080基准网站：https：//usc-isi-i2.github.io/tgcsr/

Programming machines with commonsense reasoning (CSR) abilities is a longstanding challenge in the Artificial Intelligence community. Current CSR benchmarks use multiple-choice (and in relatively fewer cases, generative) question-answering instances to evaluate machine commonsense. Recent progress in transformer-based language representation models suggest that considerable progress has been made on existing benchmarks. However, although tens of CSR benchmarks currently exist, and are growing, it is not evident that the full suite of commonsense capabilities have been systematically evaluated. Furthermore, there are doubts about whether language models are 'fitting' to a benchmark dataset's training partition by picking up on subtle, but normatively irrelevant (at least for CSR), statistical features to achieve good performance on the testing partition. To address these challenges, we propose a benchmark called Theoretically-Grounded Commonsense Reasoning (TG-CSR) that is also based on discriminative question answering, but with questions designed to evaluate diverse aspects of commonsense, such as space, time, and world states. TG-CSR is based on a subset of commonsense categories first proposed as a viable theory of commonsense by Gordon and Hobbs. The benchmark is also designed to be few-shot (and in the future, zero-shot), with only a few training and validation examples provided. This report discusses the structure and construction of the benchmark. Preliminary results suggest that the benchmark is challenging even for advanced language representation models designed for discriminative CSR question answering tasks. Benchmark access and leaderboard: https://codalab.lisn.upsaclay.fr/competitions/3080 Benchmark website: https://usc-isi-i2.github.io/TGCSR/

下载PDF全文

下载文献需遵守相关版权规定

论文标题