用于网络类型识别文档的GINCO培训数据集

论文标题

用于网络类型识别文档的GINCO培训数据集

The GINCO Training Dataset for Web Genre Identification of Documents Out in the Wild

论文作者

Kuzman, Taja, Rupnik, Peter, Ljubešić, Nikola

论文摘要

本文提出了一个新的培训数据集，用于自动类型识别Ginco，该数据集基于1,125个爬行的斯洛文尼亚网络文档，该文档由65万个单词组成。每个文档都是用新的注释模式手动注释的流派注释，该模式基于现有的schemata，主要是标签和通知者间协议的清晰度。该数据集包括与基于Web的数据有关的各种挑战，例如机器翻译内容，编码错误，一个文档等中介绍的多个内容，可以在现实条件下评估分类器。数据集上的初始机器学习实验表明，（1）预转化器模型在模拟现象上的模型差不多，宏F1度量指标范围约为0.22，而基于变压器的基于变压器的模型的得分约为0.58，并且（2）多语言变压器模型在任务上也可以在多个型模型上使用标准模型，而不是标准型型号。

This paper presents a new training dataset for automatic genre identification GINCO, which is based on 1,125 crawled Slovenian web documents that consist of 650 thousand words. Each document was manually annotated for genre with a new annotation schema that builds upon existing schemata, having primarily clarity of labels and inter-annotator agreement in mind. The dataset consists of various challenges related to web-based data, such as machine translated content, encoding errors, multiple contents presented in one document etc., enabling evaluation of classifiers in realistic conditions. The initial machine learning experiments on the dataset show that (1) pre-Transformer models are drastically less able to model the phenomena, with macro F1 metrics ranging around 0.22, while Transformer-based models achieve scores of around 0.58, and (2) multilingual Transformer models work as well on the task as the monolingual models that were previously proven to be superior to multilingual models on standard NLP tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题