多语言：多语言和跨语性词汇相似性的大规模评估

论文标题

多语言：多语言和跨语性词汇相似性的大规模评估

Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual Lexical Semantic Similarity

论文作者

Vulić, Ivan, Baker, Simon, Ponti, Edoardo Maria, Petti, Ulla, Leviant, Ira, Wing, Kelly, Majewska, Olga, Bar, Eden, Malone, Matt, Poibeau, Thierry, Reichart, Roi, Korhonen, Anna

论文摘要

我们介绍了多种词汇资源和评估基准，涵盖了12种类型上多样的语言的数据集，包括主要语言（例如，普通话中文，西班牙语，俄罗斯）以及资源较低的语言（例如，威尔士语，水，基斯瓦希里语）。每个语言数据集都被注释，以供语义相似性的词汇关系，并包含1,888个语义上的概念对，提供了单词类（名词，动词，形容词，副词），频率等级，相似性间隔，词汇范围和混凝土水平的代表性覆盖范围。此外，由于跨语言的概念对齐，我们提供了66个跨语义语义相似性数据集的套件。由于其广泛的规模和语言覆盖范围，Multi-Simlex为实验评估和分析提供了完全新颖的机会。在其单语和跨语义基准上，我们评估和分析了一系列最新的单语言和跨语义表示模型，包括静态和上下文化的单词嵌入（例如FastText，M-bert和XLM），以及外部知情的词汇表达，以及完全不受欢迎的词汇，以及完全不受约束的人（否定词）。我们还提出了一个逐步的数据集创建协议，用于为其他语言创建一致的多simlex风格资源。我们做出了这些贡献 - 公开发布多语言数据集，它们的创建协议，强大的基准结果以及深入的分析，这些分析可以有助于指导多语言词汇语义和代表性学习的未来发展 - 通过网站提供，该网站将鼓励社区努力在多种语言中进一步扩展社区努力。如此大规模的语义资源可以激发跨语言的NLP进一步进一步的进一步进步。

We introduce Multi-SimLex, a large-scale lexical resource and evaluation benchmark covering datasets for 12 typologically diverse languages, including major languages (e.g., Mandarin Chinese, Spanish, Russian) as well as less-resourced ones (e.g., Welsh, Kiswahili). Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs, providing a representative coverage of word classes (nouns, verbs, adjectives, adverbs), frequency ranks, similarity intervals, lexical fields, and concreteness levels. Additionally, owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets. Due to its extensive size and language coverage, Multi-SimLex provides entirely novel opportunities for experimental evaluation and analysis. On its monolingual and cross-lingual benchmarks, we evaluate and analyze a wide array of recent state-of-the-art monolingual and cross-lingual representation models, including static and contextualized word embeddings (such as fastText, M-BERT and XLM), externally informed lexical representations, as well as fully unsupervised and (weakly) supervised cross-lingual word embeddings. We also present a step-by-step dataset creation protocol for creating consistent, Multi-Simlex-style resources for additional languages. We make these contributions -- the public release of Multi-SimLex datasets, their creation protocol, strong baseline results, and in-depth analyses which can be be helpful in guiding future developments in multilingual lexical semantics and representation learning -- available via a website which will encourage community effort in further expansion of Multi-Simlex to many more languages. Such a large-scale semantic resource could inspire significant further advances in NLP across languages.

下载PDF全文

下载文献需遵守相关版权规定

论文标题