Simreluz：乌兹别克语语言的语义评估数据集的相似性和相关性得分

论文标题

Simreluz：乌兹别克语语言的语义评估数据集的相似性和相关性得分

SimRelUz: Similarity and Relatedness scores as a Semantic Evaluation dataset for Uzbek language

论文作者

Salaev, Ulugbek, Kuriyozov, Elmurod, Gómez-Rodríguez, Carlos

论文摘要

单词之间的语义相关性是自然语言处理中的核心概念之一，因此使语义评估成为重要的任务。在本文中，我们提供了一个语义模型评估数据集：Simreluz-低资源乌兹别克斯坦语言的单词对的相似性和相关性分数集合。该数据集由一对千对基于其形态特征，出现频率，语义关系以及由来自不同年龄段和性别的11个本地乌兹别克斯坦说话者注释的一千多对单词。我们还关注处理稀有单词和量不足的单词的问题，以彻底评估语义模型的鲁棒性。

Semantic relatedness between words is one of the core concepts in natural language processing, thus making semantic evaluation an important task. In this paper, we present a semantic model evaluation dataset: SimRelUz - a collection of similarity and relatedness scores of word pairs for the low-resource Uzbek language. The dataset consists of more than a thousand pairs of words carefully selected based on their morphological features, occurrence frequency, semantic relation, as well as annotated by eleven native Uzbek speakers from different age groups and gender. We also paid attention to the problem of dealing with rare words and out-of-vocabulary words to thoroughly evaluate the robustness of semantic models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题