论文标题
指南针一致的分布嵌入用于研究语音之间语义差异的分布嵌入
Compass-aligned Distributional Embeddings for Studying Semantic Differences across Corpora
论文作者
论文摘要
Word2Vec是最常用的算法之一,是由于效率良好,生成的表示的质量和认知接地而生成单词嵌入的算法之一。但是,单词含义不是静态的,而取决于使用单词的上下文。可以通过分析代表这些因素的集合中的不同语料库产生的嵌入来研究单词含义的差异,这些单词含义含义取决于时间,位置,主题和其他因素。例如,可以使用不同时间段发表的新闻文章的集合来研究语言演化。在本文中,我们提出了一个一般框架,以支持跨层嵌入的跨科体语言研究,其中可以比较从不同语料库中产生的嵌入,以查找整个语料库中含义的对应关系和差异。 Cade是我们框架的核心组成部分,并解决了使不同语料库产生的嵌入的关键问题。特别是,我们专注于提供有关卡德的有效性,一般性和鲁棒性的可靠证据。为此,我们在不同领域(从时间词嵌入到语言本地化和局部分析)进行定量和定性实验。我们的实验结果表明,CADE在有几种竞争方法的任务上实现了最先进的或出色的表现,但提供了可以在各种领域中使用的一般方法。最后,我们的实验阐明了对齐可靠的条件,这在很大程度上取决于跨组织词汇重叠的程度。
Word2vec is one of the most used algorithms to generate word embeddings because of a good mix of efficiency, quality of the generated representations and cognitive grounding. However, word meaning is not static and depends on the context in which words are used. Differences in word meaning that depends on time, location, topic, and other factors, can be studied by analyzing embeddings generated from different corpora in collections that are representative of these factors. For example, language evolution can be studied using a collection of news articles published in different time periods. In this paper, we present a general framework to support cross-corpora language studies with word embeddings, where embeddings generated from different corpora can be compared to find correspondences and differences in meaning across the corpora. CADE is the core component of our framework and solves the key problem of aligning the embeddings generated from different corpora. In particular, we focus on providing solid evidence about the effectiveness, generality, and robustness of CADE. To this end, we conduct quantitative and qualitative experiments in different domains, from temporal word embeddings to language localization and topical analysis. The results of our experiments suggest that CADE achieves state-of-the-art or superior performance on tasks where several competing approaches are available, yet providing a general method that can be used in a variety of domains. Finally, our experiments shed light on the conditions under which the alignment is reliable, which substantially depends on the degree of cross-corpora vocabulary overlap.