论文标题
消毒综合培训数据生成以通过知识图回答的问题
Sanitizing Synthetic Training Data Generation for Question Answering over Knowledge Graphs
论文作者
论文摘要
综合数据生成对于培训和评估神经模型对知识图的答案很重要。数据质量以及数据集将数据集分配到培训,验证和测试拆分中会影响对该数据训练的模型的性能。如果合成数据的生成取决于模板,那么该任务的主要方法也是如此,则如果未对分区进行卫生进行分区,则可能会通过模板的共享基础进行信息泄漏。本文研究了跨数据拆分的此类信息泄漏的程度,以及受过训练的模型在控制泄漏时概括到测试数据的能力。我们发现确实发生了信息泄漏,并且会影响性能。同时,训练有素的模型确实概括了在此处介绍的消毒分区下测试数据。重要的是,这些发现超出了我们研究的问题答案任务的特定风味,并提出了一系列围绕基于模板的合成数据生成的困难问题,这将需要进行其他研究。
Synthetic data generation is important to training and evaluating neural models for question answering over knowledge graphs. The quality of the data and the partitioning of the datasets into training, validation and test splits impact the performance of the models trained on this data. If the synthetic data generation depends on templates, as is the predominant approach for this task, there may be a leakage of information via a shared basis of templates across data splits if the partitioning is not performed hygienically. This paper investigates the extent of such information leakage across data splits, and the ability of trained models to generalize to test data when the leakage is controlled. We find that information leakage indeed occurs and that it affects performance. At the same time, the trained models do generalize to test data under the sanitized partitioning presented here. Importantly, these findings extend beyond the particular flavor of question answering task we studied and raise a series of difficult questions around template-based synthetic data generation that will necessitate additional research.