Synscipass：检测科学文本生成的适当用途

论文标题

Synscipass：检测科学文本生成的适当用途

SynSciPass: detecting appropriate uses of scientific text generation

论文作者

Rosati, Domenic

论文摘要

机器生成的文本检测方法倾向于集中于人类与机器书面文本的二进制分类。在科学领域，出版商可能会使用这些模型来检查提交中的手稿，错误分类可能会对作者造成伤害。此外，作者可以适当地使用文本生成模型，例如使用辅助技术（例如翻译工具）。在这种情况下，可以使用二进制分类方案来标记辅助文本生成技术的适当用途，因为这是引起关注的原因。在我们的工作中，我们通过在dagpap22上介绍了在scielo的机器翻译段落中训练的最先进的检测器，并发现该模型随机执行。鉴于这一发现，我们为数据集开发开发了一个框架，该框架通过拥有用于翻译或释义等技术类型的标签来检测机器生成的文本的细微差别方法，从而导致Synscipass的构建。通过训练在Synscipass上在dagpap22上表现良好的相同模型，我们表明该模型不仅对域移动更强大，而且还可以发现用于机器生成的文本的技术类型。尽管如此，我们得出的结论是，当前的数据集既不全面也不是现实的，无法理解这些模型在野外如何表现，这些模型可以来自许多未知或新颖的分布，它们将如何在科学的完整文本而不是小段落上执行，以及当有适当和不适当地使用自然语言生产时会发生什么。

Approaches to machine generated text detection tend to focus on binary classification of human versus machine written text. In the scientific domain where publishers might use these models to examine manuscripts under submission, misclassification has the potential to cause harm to authors. Additionally, authors may appropriately use text generation models such as with the use of assistive technologies like translation tools. In this setting, a binary classification scheme might be used to flag appropriate uses of assistive text generation technology as simply machine generated which is a cause of concern. In our work, we simulate this scenario by presenting a state-of-the-art detector trained on the DAGPap22 with machine translated passages from Scielo and find that the model performs at random. Given this finding, we develop a framework for dataset development that provides a nuanced approach to detecting machine generated text by having labels for the type of technology used such as for translation or paraphrase resulting in the construction of SynSciPass. By training the same model that performed well on DAGPap22 on SynSciPass, we show that not only is the model more robust to domain shifts but also is able to uncover the type of technology used for machine generated text. Despite this, we conclude that current datasets are neither comprehensive nor realistic enough to understand how these models would perform in the wild where manuscript submissions can come from many unknown or novel distributions, how they would perform on scientific full-texts rather than small passages, and what might happen when there is a mix of appropriate and inappropriate uses of natural language generation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题