在多模式 - 伯特中找到结构知识

论文标题

在多模式 - 伯特中找到结构知识

Finding Structural Knowledge in Multimodal-BERT

论文作者

Milewski, Victor, de Lhoneux, Miryam, Moens, Marie-Francine

论文摘要

在这项工作中，我们研究了多模式模型的嵌入中所学的知识。更具体地说，我们探究了它们存储语言数据的语法结构以及视觉数据中对象所学的结构的能力。为了实现该目标，我们首先通过分别描述图像和图像中对象区域之间的句子的依赖性解析来使语言和视觉效果的固有结构明确。我们称此明确的视觉结构为基于语言描述的依赖关系树\ textit {scene tree}。广泛的探测实验表明，多模式 - 伯特模型未编码这些场景树。编码可在\ url {https://github.com/vsjmilewski/multimodal-probes}中获得。

In this work, we investigate the knowledge learned in the embeddings of multimodal-BERT models. More specifically, we probe their capabilities of storing the grammatical structure of linguistic data and the structure learned over objects in visual data. To reach that goal, we first make the inherent structure of language and visuals explicit by a dependency parse of the sentences that describe the image and by the dependencies between the object regions in the image, respectively. We call this explicit visual structure the \textit{scene tree}, that is based on the dependency tree of the language description. Extensive probing experiments show that the multimodal-BERT models do not encode these scene trees.Code available at \url{https://github.com/VSJMilewski/multimodal-probes}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题