旨在标准化韩国语法错误校正：数据集和注释

论文标题

旨在标准化韩国语法错误校正：数据集和注释

Towards standardizing Korean Grammatical Error Correction: Datasets and Annotation

论文作者

Yoon, Soyoung, Park, Sungjoon, Kim, Gyuwan, Cho, Junhee, Park, Kihyo, Kim, Gyutae, Seo, Minjoon, Oh, Alice

论文摘要

与其他主要语言（例如英语）相比，对韩国语法错误校正（GEC）的研究受到限制。我们将这种有问题的情况归因于韩国GEC的精心设计的评估基准。在这项工作中，我们从不同来源（Kor-lang8，kor-native和kor-arearner）收集了三个数据集，这些数据集涵盖了各种韩国语法错误。考虑到韩国语法的性质，我们为韩语定义了14种错误类型，并提供Kagas（韩国自动语法错误注释系统），该系统可以自动从平行语料库中注释错误类型。我们在数据集上使用Kagas对韩语进行评估基准，并目前从数据集中训练的基线模型。我们表明，使用我们的数据集训练的模型大大优于当前使用的韩国GEC系统（Hanspell）在更广泛的错误类型上，这表明了数据集的多样性和实用性。实现和数据集是开源的。

Research on Korean grammatical error correction (GEC) is limited, compared to other major languages such as English. We attribute this problematic circumstance to the lack of a carefully designed evaluation benchmark for Korean GEC. In this work, we collect three datasets from different sources (Kor-Lang8, Kor-Native, and Kor-Learner) that covers a wide range of Korean grammatical errors. Considering the nature of Korean grammar, We then define 14 error types for Korean and provide KAGAS (Korean Automatic Grammatical error Annotation System), which can automatically annotate error types from parallel corpora. We use KAGAS on our datasets to make an evaluation benchmark for Korean, and present baseline models trained from our datasets. We show that the model trained with our datasets significantly outperforms the currently used statistical Korean GEC system (Hanspell) on a wider range of error types, demonstrating the diversity and usefulness of the datasets. The implementations and datasets are open-sourced.

下载PDF全文

下载文献需遵守相关版权规定

论文标题