CLARQ：一个大规模且多样化的数据集，用于澄清问题生成

论文标题

CLARQ：一个大规模且多样化的数据集，用于澄清问题生成

ClarQ: A large-scale and diverse dataset for Clarification Question Generation

论文作者

Kumar, Vaibhav, black, Alan W.

论文摘要

问题回答和对话系统通常会感到困惑，需要帮助澄清某些歧义。但是，现有数据集的局限性阻碍了能够生成和利用澄清问题的大型模型的开发。为了克服这些局限性，我们设计了一个新颖的自举框架（基于自学），该框架有助于创建一个基于从Stackexchange提取的后期元素来创建多样化的大规模澄清问题数据集。该框架利用基于神经网络的体系结构来分类澄清问题。这是一种两步的方法，首先旨在提高分类器的精度和第二个目标以增加召回率。我们通过将新创建的数据集应用于提问的下游任务来定量证明了新创建的数据集的实用性。最终数据集Clarq由〜2M示例组成，分布在173个stackexchange域中。我们发布该数据集，以促进澄清问题生成领域的研究，并以更大的目标和问答系统的更大目标。

Question answering and conversational systems are often baffled and need help clarifying certain ambiguities. However, limitations of existing datasets hinder the development of large-scale models capable of generating and utilising clarification questions. In order to overcome these limitations, we devise a novel bootstrapping framework (based on self-supervision) that assists in the creation of a diverse, large-scale dataset of clarification questions based on post-comment tuples extracted from stackexchange. The framework utilises a neural network based architecture for classifying clarification questions. It is a two-step method where the first aims to increase the precision of the classifier and second aims to increase its recall. We quantitatively demonstrate the utility of the newly created dataset by applying it to the downstream task of question-answering. The final dataset, ClarQ, consists of ~2M examples distributed across 173 domains of stackexchange. We release this dataset in order to foster research into the field of clarification question generation with the larger goal of enhancing dialog and question answering systems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题