论文标题
多语言:多语言和跨语性词汇相似性的大规模评估
Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual Lexical Semantic Similarity
论文作者
论文摘要
我们介绍了多种词汇资源和评估基准,涵盖了12种类型上多样的语言的数据集,包括主要语言(例如,普通话中文,西班牙语,俄罗斯)以及资源较低的语言(例如,威尔士语,水,基斯瓦希里语)。每个语言数据集都被注释,以供语义相似性的词汇关系,并包含1,888个语义上的概念对,提供了单词类(名词,动词,形容词,副词),频率等级,相似性间隔,词汇范围和混凝土水平的代表性覆盖范围。此外,由于跨语言的概念对齐,我们提供了66个跨语义语义相似性数据集的套件。由于其广泛的规模和语言覆盖范围,Multi-Simlex为实验评估和分析提供了完全新颖的机会。在其单语和跨语义基准上,我们评估和分析了一系列最新的单语言和跨语义表示模型,包括静态和上下文化的单词嵌入(例如FastText,M-bert和XLM),以及外部知情的词汇表达,以及完全不受欢迎的词汇,以及完全不受约束的人(否定词)。我们还提出了一个逐步的数据集创建协议,用于为其他语言创建一致的多simlex风格资源。我们做出了这些贡献 - 公开发布多语言数据集,它们的创建协议,强大的基准结果以及深入的分析,这些分析可以有助于指导多语言词汇语义和代表性学习的未来发展 - 通过网站提供,该网站将鼓励社区努力在多种语言中进一步扩展社区努力。如此大规模的语义资源可以激发跨语言的NLP进一步进一步的进一步进步。
We introduce Multi-SimLex, a large-scale lexical resource and evaluation benchmark covering datasets for 12 typologically diverse languages, including major languages (e.g., Mandarin Chinese, Spanish, Russian) as well as less-resourced ones (e.g., Welsh, Kiswahili). Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs, providing a representative coverage of word classes (nouns, verbs, adjectives, adverbs), frequency ranks, similarity intervals, lexical fields, and concreteness levels. Additionally, owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets. Due to its extensive size and language coverage, Multi-SimLex provides entirely novel opportunities for experimental evaluation and analysis. On its monolingual and cross-lingual benchmarks, we evaluate and analyze a wide array of recent state-of-the-art monolingual and cross-lingual representation models, including static and contextualized word embeddings (such as fastText, M-BERT and XLM), externally informed lexical representations, as well as fully unsupervised and (weakly) supervised cross-lingual word embeddings. We also present a step-by-step dataset creation protocol for creating consistent, Multi-Simlex-style resources for additional languages. We make these contributions -- the public release of Multi-SimLex datasets, their creation protocol, strong baseline results, and in-depth analyses which can be be helpful in guiding future developments in multilingual lexical semantics and representation learning -- available via a website which will encourage community effort in further expansion of Multi-Simlex to many more languages. Such a large-scale semantic resource could inspire significant further advances in NLP across languages.