论文标题
降低的语言建模
Language Modeling with Reduced Densities
论文作者
论文摘要
这项工作源于这样的观察结果,即当今最先进的统计语言模型不仅对其表现令人印象深刻,而且至关重要,因为它们完全源于非结构化文本数据中的相关性。后一个观察促使本文核心的一个基本问题:非结构化的文本数据中存在哪些数学结构?我们将丰富的类别理论作为自然答案。我们显示了有限字母的符号序列,例如在文本语料库中发现的符号序列,形成了一种富含概率的类别。然后,我们解决了第二个基本问题:如何以保留分类结构的方式存储和建模此信息?我们通过从我们丰富的文本类别构建函数到特定富集的密度运算符类别来回答这一点。后者利用Loewner订单在积极的半限定操作员上,可以进一步解释为零心的玩具例子。
This work originates from the observation that today's state-of-the-art statistical language models are impressive not only for their performance, but also - and quite crucially - because they are built entirely from correlations in unstructured text data. The latter observation prompts a fundamental question that lies at the heart of this paper: What mathematical structure exists in unstructured text data? We put forth enriched category theory as a natural answer. We show that sequences of symbols from a finite alphabet, such as those found in a corpus of text, form a category enriched over probabilities. We then address a second fundamental question: How can this information be stored and modeled in a way that preserves the categorical structure? We answer this by constructing a functor from our enriched category of text to a particular enriched category of reduced density operators. The latter leverages the Loewner order on positive semidefinite operators, which can further be interpreted as a toy example of entailment.