论文标题
Yunshan Cup 2020:低资源语言的语音标记任务概述
Yunshan Cup 2020: Overview of the Part-of-Speech Tagging Task for Low-resourced Languages
论文作者
论文摘要
Yunshan Cup 2020赛道的重点是创建一个框架,用于评估词性部分的不同方法(POS)。此曲目有两个任务:(1)印尼语言的POS标记,以及(2)老挝标签的POS标签。印尼数据集由29个标签中的印尼新闻中的10000个句子组成。老挝数据集由27个标签内的8000个句子组成。 25个团队注册了这项任务。使用经典的机器学习技术或集合方法,参与者的方法从基于特征的神经网络到神经网络。最佳性能结果的印度尼西亚人和93.03%的精度为95.82%,表明神经序列标记模型显着超过了基于经典特征的方法和基于规则的方法。
The Yunshan Cup 2020 track focused on creating a framework for evaluating different methods of part-of-speech (POS). There were two tasks for this track: (1) POS tagging for the Indonesian language, and (2) POS tagging for the Lao tagging. The Indonesian dataset is comprised of 10000 sentences from Indonesian news within 29 tags. And the Lao dataset consists of 8000 sentences within 27 tags. 25 teams registered for the task. The methods of participants ranged from feature-based to neural networks using either classical machine learning techniques or ensemble methods. The best performing results achieve an accuracy of 95.82% for Indonesian and 93.03%, showing that neural sequence labeling models significantly outperform classic feature-based methods and rule-based methods.