扬声器用上下文填补词汇语义空白

论文标题

扬声器用上下文填补词汇语义空白

Speakers Fill Lexical Semantic Gaps with Context

论文作者

Pimentel, Tiago, Maudslay, Rowan Hall, Blasi, Damián, Cotterell, Ryan

论文摘要

词汇歧义在语言上是普遍存在的，可以重新使用经济的单词形式，因此使语言更有效。但是，如果不能从上下文中歧义歧义，那么这种效率的增益可能会使语言变得不清晰，从而导致频繁沟通不畅。为了使语言清晰有效地编码，我们认为单词类型的词汇歧义应该与平均而言的信息上下文相关。为了调查这种情况，我们将单词的词汇歧义性操作为含义的熵，并提供了两种估计这一点的方法 - 一种需要人类注释（使用WordNet），而一种不（使用BERT），使其很容易适用于大量语言。我们通过表明，在六种高源语言上，我们基于BERT的歧义估计与WordNet中的同义词数量之间存在显着的Pearson相关性（例如，英语中的$ρ= 0.40 $）。然后，我们检验了我们的主要假设 - 单词的词汇歧义应与其上下文不确定性负相关 - 并在我们分析的所有18种类型上多样化的语言上找到显着的相关性。这表明，在有歧义的情况下，说话者通过使背景更有信息来弥补。

Lexical ambiguity is widespread in language, allowing for the reuse of economical word forms and therefore making language more efficient. If ambiguous words cannot be disambiguated from context, however, this gain in efficiency might make language less clear -- resulting in frequent miscommunication. For a language to be clear and efficiently encoded, we posit that the lexical ambiguity of a word type should correlate with how much information context provides about it, on average. To investigate whether this is the case, we operationalise the lexical ambiguity of a word as the entropy of meanings it can take, and provide two ways to estimate this -- one which requires human annotation (using WordNet), and one which does not (using BERT), making it readily applicable to a large number of languages. We validate these measures by showing that, on six high-resource languages, there are significant Pearson correlations between our BERT-based estimate of ambiguity and the number of synonyms a word has in WordNet (e.g. $ρ= 0.40$ in English). We then test our main hypothesis -- that a word's lexical ambiguity should negatively correlate with its contextual uncertainty -- and find significant correlations on all 18 typologically diverse languages we analyse. This suggests that, in the presence of ambiguity, speakers compensate by making contexts more informative.

下载PDF全文

下载文献需遵守相关版权规定

论文标题