结盟，理性和学习：增强医学视觉和语言预培训与知识

论文标题

结盟，理性和学习：增强医学视觉和语言预培训与知识

Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge

论文作者

Chen, Zhihong, Li, Guanbin, Wan, Xiang

论文摘要

医学视觉和语言预训练（MED-VLP）由于适用于从医学图像和文本中提取通用视觉和语言表示的适用性而受到了相当大的关注。大多数现有的方法主要包含三个要素：Uni-Modal编码器（即视觉编码器和语言编码器），多模式融合模块以及借口任务，很少有研究考虑了医疗领域专家知识的重要性，并明确利用此类知识来促进MED-VLP。尽管在通用域中存在具有知识增强的视觉和语言预训练（VLP）方法，但大多数人都需要现成的工具包（例如，对象探测器和场景图形解析器），这些工具包在医疗领域中是不可用的。在本文中，我们提出了一种系统有效的方法，从三个角度通过结构化医学知识来增强MED-VLP。首先，考虑知识可以被视为视觉和语言之间的中间媒介，我们通过知识使视觉编码器和语言编码器保持一致。其次，我们将知识注入多模式融合模型，以使模型能够使用知识作为补充输入图像和文本进行推理。第三，我们指导该模型通过设计知识引起的借口任务来强调图像和文本中最关键的信息。为了进行全面的评估并促进进一步的研究，我们构建了包括三个任务的医学视觉和语言基准。实验结果说明了我们方法的有效性，在所有下游任务上都实现了最先进的性能。进一步的分析探讨了我们方法的不同组成部分和预训练的各种环境的影响。

Medical vision-and-language pre-training (Med-VLP) has received considerable attention owing to its applicability to extracting generic vision-and-language representations from medical images and texts. Most existing methods mainly contain three elements: uni-modal encoders (i.e., a vision encoder and a language encoder), a multi-modal fusion module, and pretext tasks, with few studies considering the importance of medical domain expert knowledge and explicitly exploiting such knowledge to facilitate Med-VLP. Although there exist knowledge-enhanced vision-and-language pre-training (VLP) methods in the general domain, most require off-the-shelf toolkits (e.g., object detectors and scene graph parsers), which are unavailable in the medical domain. In this paper, we propose a systematic and effective approach to enhance Med-VLP by structured medical knowledge from three perspectives. First, considering knowledge can be regarded as the intermediate medium between vision and language, we align the representations of the vision encoder and the language encoder through knowledge. Second, we inject knowledge into the multi-modal fusion model to enable the model to perform reasoning using knowledge as the supplementation of the input image and text. Third, we guide the model to put emphasis on the most critical information in images and texts by designing knowledge-induced pretext tasks. To perform a comprehensive evaluation and facilitate further research, we construct a medical vision-and-language benchmark including three tasks. Experimental results illustrate the effectiveness of our approach, where state-of-the-art performance is achieved on all downstream tasks. Further analyses explore the effects of different components of our approach and various settings of pre-training.

下载PDF全文

下载文献需遵守相关版权规定

论文标题