论文标题
逻辑上一致的对抗攻击,用于软定理侵略
Logically Consistent Adversarial Attacks for Soft Theorem Provers
论文作者
论文摘要
AI社区中的最新努力为使用语言模型对自然语言句子的“软定理证明”产生了令人印象深刻的结果。我们提出了一个新颖的生成对抗框架,用于探测和改善这些模型的推理能力。该领域中的对抗性攻击遇到了逻辑上的不一致问题,因此对输入的扰动可能会改变标签。我们在逻辑上一致的对抗攻击者熔岩通过将结构化生成过程与符号求解器相结合,以确保逻辑一致性来解决这一问题。我们的框架成功地产生了对抗性攻击,并确定了多个目标模型中常见的全球弱点。我们的分析在这些模型的推理能力中揭示了幼稚的启发式和脆弱性,从而揭示了逻辑程序下逻辑扣除的不完整掌握。最后,除了对这些模型的有效探测外,我们还表明,对生成样品的培训可以改善目标模型的性能。
Recent efforts within the AI community have yielded impressive results towards "soft theorem proving" over natural language sentences using language models. We propose a novel, generative adversarial framework for probing and improving these models' reasoning capabilities. Adversarial attacks in this domain suffer from the logical inconsistency problem, whereby perturbations to the input may alter the label. Our Logically consistent AdVersarial Attacker, LAVA, addresses this by combining a structured generative process with a symbolic solver, guaranteeing logical consistency. Our framework successfully generates adversarial attacks and identifies global weaknesses common across multiple target models. Our analyses reveal naive heuristics and vulnerabilities in these models' reasoning capabilities, exposing an incomplete grasp of logical deduction under logic programs. Finally, in addition to effective probing of these models, we show that training on the generated samples improves the target model's performance.